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Foreword 



This volume contains the papers presented at the 13th International Workshop 
on Languages and Compilers for Parallel Computing. It also contains extended 
abstracts of submissions that were accepted as posters. The workshop was held 
at the IBM T. J. Watson Research Center in Yorktown Heights, New York. 
As in previous years, the workshop focused on issues in optimizing compilers, 
languages, and software environments for high performance computing. This 
continues a trend in which languages, compilers, and software environments for 
high performance computing, and not strictly parallel computing, has been the 
organizing topic. As in past years, participants came from Asia, North America, 
and Europe. 

This workshop reflected the work of many people. In particular, the members 
of the steering committee, David Padua, Alex Nicolau, Utpal Banerjee, and 
David Gelernter, have been instrumental in maintaining the focus and quality of 
the workshop since it was first held in 1988 in Urbana-Champaign. The assistance 
of the other members of the program committee - Larry Carter, Sid Chatterjee, 
Jeanne Ferrante, Jans Prins, Bill Pugh, and Chau-wen Tseng - was crucial. The 
infrastructure at the IBM T. J. Watson Research Center provided trouble-free 
logistical support. The IBM T. J. Watson Research Center also provided financial 
support by underwriting much of the expense of the workshop. Appreciation 
must also be extended to Marc Snir and Pratap Pattnaik of the IBM T. J. 
Watson Research Center for their support. 

Finally, we would like to thank the referees who spent countless hours as- 
sisting the program committee members in evaluating the quality of the sub- 
missions: Scott B. Baden, Jean-Francois Collard, Val Donaldson, Rudolf Eigen- 
mann, Stephen Fink, Kang Su Gatlin, Michael Hind, Francois Irigoin, Pramod 
G. Joisha, Gabriele Keller, Wolf Pfannenstiel, Lawrence Rauchweger, Martin 
Simons, D. B. Skillicorn, Hong Tang, and Hao Yu. 
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Accurate Shape Analysis for Recursive Data 

Structures* 



Francisco Corbera, Rafael Asenjo, and Emilio Zapata 

Dept. Computer Architecture, University of Malaga, Spain 
{corbera, asenjo, ezapatajOac . uma. es 



Abstract. Automatic parallelization of codes which use dynamic data 
structures is still a challenge. One of the first steps in such paralleliza- 
tion is the automatic detection of the dynamic data structure used in the 
code. In this paper we describe the framework and the compiler we have 
implemented to capture complex data structures generated, traversed, 
and modified in C codes. Our method assigns a Reduced Set of Refer- 
ence Shape Graphs (RSRSG) to each sentence to approximate the shape 
of the data structure after the execution of such a sentence. With the 
properties and operations that define the behavior of our RSRSG, the 
method can accurately detect complex recursive data structures such as 
a doubly linked list of pointers to trees where the leaves point to addi- 
tional lists. Other experiments are carried out with real codes to validate 
the capabilities of our compiler. 



1 Introduction 

For complex and time-consuming applications, parallel programming is a must. 
Automatic parallelizing compilers are designed with the aim of dramatically re- 
ducing the time needed to develop a parallel program by generating a parallel 
version from a sequential code without special annotations. There are several 
well-known research groups involved in the development and improvement of 
parallel compilers, such as Polaris, PFA, Parafrase, SUIF, etc. We have noted 
that the detection step of current parallelizing compilers does a pretty good job 
when dealing with regular or numeric codes. However, they cannot manage irreg- 
ular codes or symbolic ones, which are mainly based on complex data structures 
which use pointers in many cases. Actually, data dependence analysis is quite 
well known for array-based codes even when complex array access functions are 
present jn|. On the other hand, much less work has been done to successfully 
determine the data dependencies of code sections using dynamic data structures 
based on pointers. Nevertheless, this is a problem that cannot be avoided due 
to the increasing use of dynamic structures and memory pointer references. 

* This work was supported by the Ministry of Education and Science (GIGYT) of 
Spain (TIG96-1125-G03), by the European Union (BRITE-EURAM III BE95-1564), 
by APART: Automatic Performance Analysis: Resources and Tools, EU Esprit IV 
Working Group No. 29488 

S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 1-E3 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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With this motivation, our goal is to propose and implement new techniques 
that can be included in compilers to allow the automatic parallelization of real 
codes based on dynamic data structures. From this goal we have selected the 
shape analysis subproblem, which aims at estimating at compile time the shape 
the data will take at run time. Given this information, a subsequent analysis 
would detect whether or not certain sections of the code can be parallelized 
because they access independent data regions. 

There are several ways this problem can be approached, but we focus in the 
graph-based methods in which the “storage chunks” are represented by nodes, 
and edges are used to represent references between them 12], jsi, m In a pre- 
vious work P|, we combined and extended several ideas from these previous 
graph-based methods, for example, allowing more than a summary node per 
graph among other extensions. However, we keep the restriction of one graph 
per sentence in the code. This way, since each sentence of the code can be reached 
after following several paths in the control flow, the associated graph should ap- 
proximate all the possible memory configurations arising after the execution of 
this sentence. This restriction leads to memory and time saving, but at the same 
time it significantly reduces the accuracy of the method. In this work, we have 
changed our previous direction by selecting a tradeoff solution: we consider sev- 
eral graphs with more than a summary node, while fulfilling some rules to avoid 
an explosion in the number of graphs and nodes in each graph. 

Among the first relevant studies which allowed several graphs were those 
developed by Jones et al. 0 and Horwitz et al. |0|. These approaches are based 
on a “k-limited” approximation in which all nodes beyond a k selectors path 
are joined in a summary node. The main drawback to these methods is that the 
node analysis beyond the “k-limit” is very inexact and therefore they are unable 
to capture complex data structures. A more recent work that also allows several 
graphs and summary nodes is the one presented by Sagiv et al. m- They propose 
a parametric framework based on a 3- valued logic. To describe the memory 
configuration they use 3-valued structures defined by several predicates. These 
predicates determine the accuracy of the method. As far as we know the currently 
proposed predicates do not suffice to deal with the complex data structures that 
we handle in this paper. 

With this in mind, our proposal is based on approximating all the possible 
memory configurations that can arise after the execution of a sentence by a 
set of graphs: the Reduced Set of Reference Shape Graphs (RSRSG). We see 
that each RSRSG is a collection of Reference Shape Graphs (RSG) each one 
containing several non-compatible nodes. Finally, each node represents one or 
several memory locations. Gompatible nodes are “summarized” into a single one. 
Two nodes are compatible if they share the same reference properties. With this 
framework we can achieve accurate results without excessive compilation time. 
Besides this, we cover situations that were previously unsolved, such as detection 
of complex structures (lists of trees, lists of lists, etc.) and structure permutation, 
as we will see in this article. 
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The rest of the paper is organized as follows. Section |21 briefly describes the 
whole framework, introducing the key ideas of the method and presenting the 
data structure example that will help in understanding node properties and op- 
erations with graphs. These properties are described in Sect. 0 where we show 
how the RSG can accurately approximate a memory configuration. The analysis 
method have been implemented in a compiler which is experimentally validated, 
in Sect. 21 by analyzing several C codes based on complex data structures. Fi- 
nally, we summarize the main contributions and future work in Sect.|^ 



2 Method Overview 

Basically, our method is based on approximating all possible memory configu- 
rations that can appear after the execution of a sentence in the code. Note that 
due to the control flow of the program, a sentence could be reached by following 
several paths in the control flow. Each “control path” has an associated mem- 
ory configuration which is modified by each sentence in the path. Therefore, a 
single sentence in the code modifies all the memory configurations associated 
with all the control paths reaching this sentence. Each memory configuration is 
approximated by a graph we call Reference Shape Graphs (RSG). So, taking all 
this into account, we conclude that each sentence in the code will have a set of 
RSGs associated with it. This set of RSGs will describe the shape of the data 
structure after the execution of this sentence. 

The calculation of this set of graphs is carried out by the symbolic ex- 
ecution of the program over the graphs. In this way, each program sentence 
transforms the graphs to reflect the changes in the memory configurations de- 
rived from the sentence execution. The RSGs are graphs in which nodes repre- 
sent memory locations which have similar reference patterns. Therefore, a single 
node can safely and accurately represents several memory locations (if they are 
similarly referenced) without losing their essential characteristics. 

To determine whether or not two memory locations should be represented by 
a single node, each one is annotated with a set of properties. Now, two different 
memory locations will be “summarized” in a single node if they fulfill the same 
properties. Note that the node inherits the properties of the memory locations 
represented by this node. Besides this, two nodes can be also summarized if 
they represent “summarizable” memory locations. This way, a possibly unlimited 
memory configuration can be represented by a limited size RSG, because the 
number of different nodes is limited by the number of properties of each node. 
These properties are related to the reference pattern used to access the memory 
locations represented by the node. Hence the name Reference Shape Graph. 

As we have said, all possible memory configurations which may arise after 
the execution of a sentence are approximated by a set of RSGs. We call this 
set Reduced Set of Reference Shape Graphs (RSRSG), since not all the different 
RSGs arising in each sentence will be kept. On the contrary, several RSGs related 
to different memory configurations will be fused when they represent memory 
locations with similar reference patterns. There are also several properties related 
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to the RSGs, and two RSGs should share these properties to be joined. Therefore, 
besides the number of nodes in an RSG, the number of different RSGs associated 
with a sentence are limited too. This union of RSGs greatly reduces the number 
of RSGs and leads to a practicable analysis. 

The symbolic execution of the code consists in the abstract interpretation of 
each sentence in the code. This abstract interpretation is carried out iteratively 
for each sentence until we reach a fixed point in which the resulting RSRSG 
associated with the sentence does not change any more 0]. This way, for each 
sentence that modifies dynamic structures, we have to define the abstract se- 
mantics which describes how these sentences modify the RSRSG. We consider 
six simple instructions that deal with pointers: x = NULL, x = malloc, x = y, 
X sel = NULL, x ^ sel = y, and x = y ^ sel. More complex pointer 
instructions can be built upon these simple ones and temporal variables. 

The output RSRSG resulting from the abstract interpretation of a sentence 
over an input RSRSG^ is generated by applying the abstract interpretation to 
each rsgi G RSRSG^. After the abstract interpretation of the sentence over 
the rsgi G RSRSGi we obtain a set of output rsgo- As we said, we cannot 
keep all the rsgo arising from the abstract interpretation. On the contrary, each 
rsgo will be compressed, which means the summarization of compatible nodes 
in the rsgo- Furthermore, some of the rsgoS can be fused in a single RSG if 
they represent similar memory configurations. This operation greatly reduces 
the number of RSGs in the resulting RSRSG. In the worst case, the sequence 
of operations that the compiler carries out in order to symbolically execute a 
sentence are: graph division, graph prune, sentence symbolic execution (RSG 
modification), RSG compression and RSG union to build the final RSRSG. Due 
to space constraints we cannot formally describe this operations neither the 
abstract semantics carried out by the compiler. However, in order to provide 
an overview of our method we present a data structure example which will be 
refered to during the framework and operations description. 



2.1 Working Example 

The data structure, presented in Fig. Q] (a), is a doubly linked list of pointers 
to trees. Besides this, the leaves of the trees have pointers to doubly linked 
lists. The pointer variable S points to the first element of the doubly linked list 
(header list). Each item in this list has three pointers: nxt, prv, and tree. This 
tree selector points to the root of a binary tree in which each element has the 
I ft and rgh selectors. Finally, the leaves of the trees point to additional doubly 
linked lists. All the trees pointed to by the header list are independent and do 
not share any element. In the same way, the lists pointed to by the leaves of the 
same tree or different trees are also independent. 

This data structure is built by a G code which traverses the elements of the 
header list with two pointers and eventually can permute two trees. Our compiler 
has analyzed this code obtaining an RSRSG for each sentence in the program. 
Figure Q](b) shows a compact represetation of the RSRSG obtained for the last 
sentence of the code after the compiler analysis. 
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Fig. 1. A complex data structure and compact representation of the resulting 
RSRSG. 



As we will see in the next sections, from the RSRSG represented in Fig. E(b) 
we can infer the actual properties of the real data structure: the trees and lists 
do not share elements and therefore they can be traversed in parallel. These 
results and other examples of real codes (sparse matrix-vector multiplication, 
sparse LU factorization and Barnes-Hut N-body simulation) with different data 
structures are presented in Sect. 0 But first, we describe our framework with a 
formal description of the RSGs in the next section. 

3 Reference Shape Graph 

First, we need to present the notation used to describe the different memory 
configurations that may arise when executing a program. 

Definition 1 Wt call a collection of dynamic structures a memory configura- 
tion. These structures comprise several memory chunks, that we call memory 
locations, which are linked hy references. Inside these memory locations there is 
room for data and for pointers to other memory locations. These pointers are 
called selectors. 

We represent the memory configurations with the tuple M = (L, P, S, PS, LS) 
where: L is the set of memory locations; P is the set of pointer variables (pvars) 
used in the program; S is the set of selectors declared in the data structures; PS 
is the set of references from pvars to memory locations, of the type < pvar, I >, 
with pvar G P and I G L; and LS is the set of links between memory locations, 
of the form < h, sel, I 2 > where l\ G L references I2 G L by selector sel G S. 

We will use L(m), P(m), S(m), PS(m), and LS(m) to refer to each one of 
these sets for a particular memory configuration, m. □ 

Therefore, we can assume that the RSRSG of a program sentence is an ap- 
proximation of the memory configuration, M , after the execution of this sen- 
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fence. But let us first describe the RSGs now that we know how a memory 
configuration is defined. 

Definition 2 An RSG is a graph represented by the tuple RSG = (iV, P, S', PL, 
NL) where: N.- is the set of nodes. Each node can represent several memory lo- 
cations if they fulfill certain common properties; P.' is the set of pointer variables 
(pvars) used in the program; S.- is the set of declared selectors; PL.- is the set 
of references from pvars to nodes, of the type < pvar, n > with pvar G P and 
n € N; and NL.- is the set of links between nodes, of the type < n\,sel,n2 > 
where ni G N references U2 G N by selector sel G S. 

We will use N(rsg), P(rsg), S(rsg), PL(rsg), and NL(rsg) to refer to each 
one of these sets for a particular RSG, rsg. □ 

To obtain the RSG which approximates the memory configuration, M, an 
abstraction function is used, F : M RSG. This function maps memory lo- 
cations into nodes and references to memory locations into references to nodes 
at the same time. In other words, F, translates the memory domain into the 
graph domain. This, function F comprises three functions: Fn : L ^ N takes 
care of mapping memory locations into nodes; Fp : PS — > PL maps refer- 
ences from pvars to memory locations into references from pvars to nodes, and 
Fi : LS NL maps links between locations into links between nodes. 

It is easy to see that: Pp(< pvar, I >) =< pvar,n > iif P„(/) = n and 
Fi{< li,sel,l2 >) =< n\,sel,n2 > iif Fn{l\) = ni A Fnih) = n2 which means 
that translating references to locations into references to nodes is trivial after 
mapping locations into nodes. This translates almost all the complexity involved 
in function F to function F„ which actually maps locations into nodes. 

Now we focus on P„ which extracts some important properties from a mem- 
ory location and, depending on these, this location is translated into a node. 
Besides this, if several memory locations share the same properties then this 
function maps all of them into the same node of the RSG. Due to this depen- 
dence on the location properties, the Fn function will be described during the 
presentation of the different properties which characterize each node. These prop- 
erties are: Type, Structure, Simple Paths, Reference pattern, Share information, 
and Cycle links. These are now described. 



3.1 Type 

This property tries to extract information from the code text. The idea is that 
two pointers of different types should not point to the same memory location 
(for example, a pointer to a graph and another to a list). Also, the memory 
location pointed to by a graph pointer should be considered as a graph. This 
way, we assign the TYPE property, of a memory location I, from the type of the 
pointer variable used when that memory location I is created (by malloc or in 
the declarative section of the code). 

Therefore, two memory locations, l\ and I2, can be mapped into the same 
node if they share the same TYPE value: If Fn(l\) = Fn{l2) = n then TYPE(li) = 
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TYPE(/ 2 ), where the TYPEO function returns the TYPE value. Note that we can 
use the same property name and function, TYPEO, for both memory locations 
and nodes. Clearly, TYPE(n) = TYPE(Z) when F„(Z) = n. 

This property leads to the situation where, for the data structure presented in 
Fig. in (a), the nodes representing list memory locations will not be summarized 
with those nodes representing tree locations, as we can see in Fig. in(b). 

3.2 Structure 

As we have just seen, the TYPE property keeps two different nodes for two memory 
locations of different types. However, we also want to avoid the summarization 
of two nodes which represent memory locations of the same type but which 
do not share any element. That is, they are non-connected components. This 
behavior is achieved by the use of the STRUCTURE property, which takes the 
same value for all memory locations (and nodes) belonging to the same connected 
component. Again, two memory locations can be represented by the same node if 
they have the same STRUCTURE value: If Fn{la) = Fn{lb) = n then STRUCTURE(Zq) 
= STRUCTURE (4) 

This leads to the condition that two locations must fulfill in order to share 
the same STRUCTURE value: STRUCTURE(/a) = STRUCTURE(4) = val iif 3Zi, ..., /i|(< 
^ 5 ^ ^1,55/25 ^2 LS^ V (<f 

Ii,sel2,l2 >)•■•,< k, seli+i,la >G LS), which means that two memory loca- 
tions, la and lb, have the same STRUCTURE value if there is a path from la to lb 
(first part of the previous equation) or from lb to la (second part of the equation) . 
In the same way we can define STRUCTURE(n). 

3.3 Simple Paths 

The SPATH property further restricts the summarizing of nodes. Simple paths 
denominates the access path from a pointer variable (pvar) to a location or node. 
An example of a simple path is p — > s in which the pvar p points to the location 
s. In this example the simple path for s is < p >. The use of the simple path 
avoids the summarization of nodes which are directly pointed to by the pvars 
(actually, these nodes are the entry points to the data structure). We define the 
SPATH property for a memory location I G L{m) as SPATH(Z) = {pi, ..,p„} where 
< Pi, I >G PS{m). This property is similarly defined for the RSG domain. Now, 
two memory locations are represented by the same node if they have the same 
SPATH (If F„(Zi) = Faih) then SPATH(Zi) = SPATH(Z2)). 

3.4 Reference Patterns 

This new property is introduced to classify and represent by different nodes the 
memory locations with different reference patterns. We understand by reference 
pattern the type of selectors which point to a certain memory location and which 
point from this memory location to others. 
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This is particularly useful for keeping singular memory locations of the data 
structure in separate nodes. For example, the head/tail and the rest of the el- 
ements of a single linked list are two kinds of memory locations. These will be 
represented by different nodes, because the head location is not referenced by 
other list entries and the tail location does not reference any other list loca- 
tion. The same would also happen for more complex data structures built upon 
more simple structures (such as lists of lists, trees of lists, etc.). For example, 
in Fig. □ (a), the root of one of the trees is referenced by the header list and 
the leaves do not point to tree items but to a doubly linked list. Thanks to the 
reference patterns, the method results in the RSRSG of Fig. 12 (b), where the 
root of the tree, the leaves, and the other tree items are clearly identified. 

In order to obtain this behavior, we define two sets SELINset and SELDUTset 
which contain the set of input and output selectors for a location: SELINset(li) 
= {seli G S'|3l2 G L,< l2,seli,li >G LS} and SELOUTset(li) = {sek G 5'|3l2 G 
L, < li,seli,l2 >G LS}, where we see that sek is in the SELINset(/i) if k is 
referenced from somewhere by selector sek, or sek is in SELOUTset(li) if li.sek 
points to somewhere outside. 

3.5 Share Information 

This is a key property for informing the compiler about the potential parallelism 
exhibited by the analyzed data structure. Actually, the share information can 
tell whether at least one of the locations represented by a node is referenced 
more than once from other memory locations. That is, a shared node represents 
memory locations which can be accessed from several places and this may prevent 
the parallelization of the code section which traverses these memory locations. 
From another point of view, this property helps us to determine if a cycle in 
the RSG graph is representing cycles in the data structure approximated by this 
RSG or not. 

Due to the relevance of this property, we use two kinds of attributes for each 
node: (i) SHARED(n) with n G N{rsg), is a Boolean function which returns “true” 
if any of the locations, l\, represented by n are referenced by other locations, I2 
and /s, by different selectors, sek and selj. Therefore, this SHARED function tells 
us whether there may be a cycle in the data structure represented by the RSG 
or not. If SHARED(n) is 0, we know that even if we reach the node n by sell and 
later by sel 2 , we are actually reaching two different memory locations represented 
by the same n node, and therefore there is no cycle in the approximated data 
structure, (ii) SHSEL(n, sel) with n G N(rsg) and sel G S', is a Boolean function 
which returns “true” if any of the memory locations, l±, represented by n can 
be referenced more than once by selector sel from other locations, I2 and I3. 
This way, with the SHSEL function, we can distinguish two different situations 
that can be represented by an RSG with a node, n, and a selector, sel, pointing 
to itself. If SHSEL(n, sel) = 0 we know that this node is representing an acyclic 
unbounded data structure (the size is not known at compile time). For example, 
in a list, all the elements of the list (locations) are represented by the same node, 
n, but following selector sel we always reach a different memory location. On the 
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other hand, if SHSEL(n, sel) = 1, for the same list example, by following selector 
sel we can reach an already visited location, which means that there are cycles 
in the data structure. 

Let’s illustrate these SHARED and SHSEL properties using the compact repre- 
sentation of the RSRSG presented in Fig.Q] (b). In this Fig., shaded nodes have 
the SHARED property set to true. For example, in the header list the middle node 
ri2 is shared, SHARED(n2)=l, because ri2 is referenced by selectors nxt and prv. 
However, the SHSEL(n2, na:<)=SHSEL(n2,pr?;) =0 which means that by following 
selector nxt or prv it is not possible to reach an already visited memory location. 
Actually, in this example, there are no selectors with the SHSEL property set to 
true. So, the same happens for node ng, which represents the middle items of the 
doubly linked lists. 

We can also see in Fig. □ (b), that node ri4 is not shared, which states that, in 
fact, from memory locations represented by ni, U2, and ng we can reach different 
trees which do not share any elements (as we see in Fig. Q] (a)). Finally, node 
nr is shared because it is pointed to by selectors list and prv. However, due to 
SHSEL(n7, list )=0 we can ensure that two different leaves of the trees will never 
point to the same doubly linked list. 



3.6 Cycle Links 

The goal of this property is to increase the accuracy of the data structure rep- 
resentation by avoiding unnecessary edges that can appear during the RSG up- 
dating process. 

The cycle links of a node, n, are defined as the set of pairs of references 

< selijSelj > such that when starting at node n and consecutively following 
selectors sek and selj, the n node is reached again. More precisely, for n G 
N{rsg) we define: CYCLELINKS(n) = {< seU.selj > \seli,selj G 5}, such that if 

< seli,selj >G CYCLELINKS(n) then: yii,Fn{li) = n, if < li,seli,lj >G LS then 
3 < Ij, selj, k >G LS. 

This CYCLELINKS set maintains similar information to that of “identity paths” 
in the Abstract Storage Graph (ASG) 0, which is very useful for dealing 
with doubly-linked structures. For example, in the data structure presented in 
Fig. HI (a), the elements in the middle of the doubly linked lists have two cycle 
links: < nxt, prv > and < prv, nxt >, due to starting at a list item and con- 
secutively following selectors nxt and prv (or prv and nxt) the starting item is 
reached. Note, that this does not apply to the first or last element of the doubly 
linked list. This property is captured in the RSRSG shown in Fig. □ (b) where 
we see three nodes for the double linked lists (one for the first element of the 
list, another for the last element, and another between them to represent the 
middle items in the list). This middle node, ng, is annotated by our compiler 
with CYCLELINKS(n8) = {< nxt, prv >, < prv, nxt >}. 

We conclude here that the CYCLELINKS property is used during the pruning 
process which take place after the node materialization and RSG modification. 
So, in contrast with the other five properties described in previous subsections. 
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the CYCLELINKS property does not prevent the summarization of two nodes with 
do not share the CYCLELINKS sets and therefore do not affect the F„ function. 

3.7 Compression of Graphs 

After the symbolic execution of a sentence over an input RSRSG, the resulting 
RSRSG may contain RSGs with redundant information, which can be removed 
due to node summarization or compression of the RSG. 

In order to do this, after the symbolic execution of a sentence, the method 
applies the COMPRESS function over the just modified RSGs. This COMPRESS func- 
tion first call to the boolean C_N0DES_RSG one, which identifies the compatible 
nodes that will later be summarized. This Boolean function just has to check 
whether or not the first five properties previously described are the same for 
both nodes (as we said in the previous subsection the CYCLELINKS property does 
not affect the compatibility of two nodes) . 

There is a similar function which returns true when two memory locations are 
compatible. With this, we can finally define F„ as the function which maps all 
the compatible memory locations into the same node, which happens when they 
have the same TYPE, STRUCTURE, SPATH and SHARED properties, and compatible 
reference patterns. 

4 Experimental Results 

All the previously mentioned operations and properties have been implemented 
in a compiler written in G which analyzes a G code to generate the RSRSG 
associated with each sentence of the code. As we said, the symbolic execution of 
each sentence over an RSRSG is going to generate a modified RSRSG. Before the 
symbolic execution of the code, the compiler can also extract some important 
information from the program in a previous pass. For example, a quite frequent 
pattern arising in G codes based on dynamic data structures is the following: 
while (x != NULL) { ... }. 

In this case the compiler can assert that at the entry of the while body the 
pvar X yf NULL. Besides this, if we have not exited the while body with a break 
sentence, we can also ensure that just after the while body the pvar x = NU LL. 
This information is used to simplify the analysis and increase the accuracy of the 
method. More precisely, we can reduce the number of RSGs and/or reduce the 
complexity of this RSG by diminishing the number of memory configurations 
represented by each RSG. Other sentences from which we can extract useful 
information are IF-THEN-ELSE, FOR loops, or any conditional sentence. 

The implementation of this idea has been carried out by the definition of 
certain pseudoinstructions that we call FORCE. These pseudoinstructions are in- 
serted in the code by the first pass of the compiler and will be symbolically 
executed as regular sentences. Therefore, each one of these FORCE sentences has 
its own abstract semantics and its own associated RSRSG. The FORCE pseudoin- 
structions we have considered are: F0RCE[2,=^7VC/LL](''’'Sff)) F0RCE[2 ,!^jvc/ll](?''S5), 
F 0 RCE[,,^=y] (rs^), FORCEf,,!^^] (rsg), FORCEi^^^^i^^NULL]irsg). 
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In addition, we have found that the FORCEj^ji^tvc/ll] pseudoperation can be 
also placed just before the following sentences: x ^ sel = NULL, x ^ sel = y, 
y = X ^ sel and for any sentence with an occurrence of the type x — > val, where 
val is a non-pointer field, under the assumption that the code is correct. That is, 
it makes sense to assume that before the execution of all these three sentences, x 
is not NULL (in other cases the code would produce an error at execution time). 

With the compiler described we have analyzed the code which generates, 
traverses, and modifies several codes: the working example presented in section 
EH the Sparse Matrix- vector multiplication, the Sparse LU factorization and the 
Barnes-Hut code. All these codes were analyzed by our compiler in a Pentium 
III 500MHz with I28MBytes main memory. The time and memory required by 
the compiler are summarized in table [H The particular aspects of these codes 
are described next. 

Table 1. Time and space required by the compiler to analyze several codes 





Working Example 


S. Matrix- vector 


S. LU factorization 


Barnes-Hut 


Time 


0T3” 


0’04” 


1’38” 


4’04” 


Space 


2.7 MB 


1.6 MB 


24 MB 


47 MB 



4.1 Working Example’s RSRSG 

We refer in this subsection to the code that generates, traverses, and modifies the 
data structure presented in Fig. D(a). A compact representation of the resulting 
RSRSG for the last sentence of the code can be seen in Fig. n(b). Although we 
do not show the code due to space constraints, we have to say that this code 
presents an additional difficulty due to some tree permutations being carried out 
during data structure modification. The problem arising during structure per- 
mutation is that it is very easy to temporally assign the SHARED=true property 
to the root of one of the trees that we are permutating, when this root is tem- 
porally pointed to by two different locations from the header list. If this shared 
property remains true after the permutation we would have a shaded node 
in Fig. □ (b). This would imply that two different items from the header list can 
point to the same tree (which would prevent the parallel execution of traversing 
the trees). However, this problem is solved because, after the permutation, the 
method reassigns false to the shared property thanks to the combination of our 
properties and the division of graph operations. Summarizing, after the compiler 
analyzes this code, the compact representation of the resulting RSRSG for the 
last sentence of the program (Fig.Q](b)) accurately describes the data structure 
depicted in Fig. [D(a) in the sense that: (i) The compiler successfully detects the 
doubly linked list which is acyclic by selectors nxt or prv and whose elements 
point to binary trees; (ii) As SHSEL(n 4 , tree)=0, we can say that two different 
items of the header list cannot point to the same tree; (iii) At the same time, 
as no tree node (ri 4 , and tiq) is shared, we can say that different trees do not 
share items; (iv) The same happens for the doubly linked list pointed to by the 
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tree leaves: all the lists are independent, there are no two leaves pointing to the 
same list, and these lists are acyclic by selectors nxt or prv. 

Besides this, our compiler has also analyzed three C program kernels which 
generate, traverse, and modify complex dynamic data structures which we de- 
scribe next. 

4.2 Sparse Matrix- Vector Multiplication 

Here we deal with an irregular code which implements a sparse matrix by vector 
multiplication, r = M x v. The sparse matrix, M, is stored in memory as a 
header doubly linked list with pointers to other doubly linked lists which store 
the matrix rows. The sparse vectors, v and r are also doubly linked lists. This 
can be seen in Fig. Eta). Note that vector r grows during the multiplication 
process. 




Fig. 2. Sparse matrix-vector multiplication data structure and compacted 
RSRSG. 

After the analysis process, carried out by our compiler, the resulting RSRSG 
accurately represents this data structure. Actually, in Fig. E^b) we present a 
compact representation of the resulting RSRSG for the last sentence of the code. 
First, note that the three structures involved in this code are kept in separate 
subgraphs. Even when the data type for vectors v and r and rows of M, is the 
same, the STRUCTURE property avoids the union of these graphs into a single one. 
This RSRSG states that the rows of the matrix are pointed to from different 
elements of the header list (there is no selector with the shared property set to 
true). Also, the doubly linked lists which store the rows of M and the vectors v 
and r are acyclic by selectors nxt and prv. 

The same RSRSG is also reached just before the execution of the outermost 
loop which takes care of the matrix multiplication, but without the r subgraph 
which is generated during this multiplication. 
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4.3 Sparse LU Factorization 

The kernel of many computer-assisted scientific applications is to solve large 
sparse linear systems. The code we analyze now solves non-symmetric sparse 
linear systems by applying the LU factorization of the sparse matrix, computed 
by using a general method. In particular, we have implemented an in-place code 
for the direct right-looking LU algorithm, where an n-by-n matrix A is factorized. 
The code includes a column pivoting operation (partial pivoting) to provide 
numerical stability and preserve sparseness. The input matrix A columns as well 
as the resulting in place LU columns are stored in one-dimensional doubly linked 
lists (see Fig. 0(a)), to facilitate the insertion of new entries and to allow column 
permutations. 





Fig. 3. Sparse LU factorization data structure and compacted RSRSG. 

After the LU code analysis we obtain the same RSRSG for the sentences 
just after the matrix initialization and after the LU factorization. A compact 
representation of this RSRSG is shown in Fig. Elb). As we can see, variable A 
points to a doubly linked list, the header list and each node of this list points 
to a single doubly linked list which represents a matrix column. The properties 
of the data structure represented by this RSRSG can be inferred following the 
same arguments we presented in the previous subsection. 



4.4 Barnes-Hut N-body Simulation 

This code is based on the algorithm presented in P which is used in astrophysics. 
The data structure used in this code is based on a hierarchical octree representa- 
tion of space in three dimensions. In two dimensions, a quadtree representation is 
used. However, due to memory constraints (the octree and the quadtree versions 
run out of memory in our I28MB Pentium III) we have simplified the code to 
use a binary tree. In Fig. 0(a) we present a schematic view of the data structure 
used in this code. The bodies are stored by a single linked list pointed to by the 
pvar Lbodies. 
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Fig. 4. Barnes-Hut data structure and compacted RSRSG. 

After the code analysis, the compact representation of the RSRSG at the end 
of each step of the algorithm is presented in Fig. EJb). We can see that the root 
of the tree is represented by node rii , the middle elements of the tree by node ri2 
and the leaves by n^. Note that these leaves can point to any body stored in the 
Lhodies list represented by nodes ri4, n^, and riQ. As tree nodes are not shared 
and selectors also have the SHSEL property set to false, a subsequent analysis of 
the code can state that the tree can be traversed and updated in parallel. This 
analysis can also conclude that there are no two different leaves pointing to the 
same body (entry in the Lbodies list) due to nodes ri4, ns, and ng not being 
shared by selector body. 

5 Conclusions and Future Work 

We have developed a compiler which can analyze a G code to determine the 
RSRSG associated with each sentence of the code. Each RSRSG contains several 
RSGs, each one representing the different data structures which may arise after 
following different paths in the control flow graph of the code. However, several 
RSGs can be joined if they represent similar data structures, in this way reducing 
the number of RSGs associated with a sentence. Every RSG contains nodes which 
represent one or several memory locations. To avoid an explosion in the number 
of nodes, all the memory locations which are similarly referenced are represented 
by the same node. This reference similarity is captured by the properties we 
assign to the memory locations. In comparison with previous works, we have 
increased the number of properties assigned to each node. This leads to more 
nodes in the RSG because the nodes now have to fulfill more properties to be 
summarized. However, by avoiding the summarization of these nodes, we keep 
a more accurate representation of the data structure. This is a key issue when 
analyzing the parallelism exhibited by a code. 

Our compiler symbolically executes each sentence in the code, transforming 
the RSGs to reflect the modifications in the data structure that are carried out 
due to the execution of the sentence. We have validated the compiler with several 
G codes which generate, traverse, and modify complex dynamic data structures. 



Accurate Shape Analysis for Recursive Data Structures 



15 



such as a doubly linked list of pointers to trees where the leaves point to other 
doubly linked lists. Other structures have been also accurately identified by the 
compiler, even in the presence of structure permutations (for example, column 
permutations in the sparse LU code). As far as we know, there is no compiler 
achieving such successful results for these kinds of data structures appearing in 
real codes. 

In the near future we will develop an additional compiler pass that will 
automatically analyze the RSRSGs and the code to determine the parallel loops 
of the program and allow the automatic generation of parallel code. 
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Abstract. The Abstract Parallel Machine (APM) model separates the 
definitions of parallel operations from the application algorithm, which 
defines the sequence of parallel operations to be executed. An APM con- 
tains a set of parallel operation definitions, which specify how the com- 
putation is organized into independent sites of computation and what 
data exchanges are required. This paper adds explicit cost models as the 
third component of an APM system. The costs of parallel operations 
can be obtained either by analyzing a parallel operation definition, or by 
measuring performance on a real machine. Costs with monotonicity con- 
straints allow the cost of an algorithm to be transformed automatically 
as the algorithm itself is transformed. 



1 Introduction 

There is increasing recognition of the fundamental role that cost models play 
in the design of parallel programs m- They enable the time and space com- 
plexity of a program to be determined, as for sequential algorithms, but parallel 
cost models serve several additional purposes. For example, intuition is often an 
inadequate basis for making the right choices about organizing the parallelism 
and using the system’s resources. Suitable high level cost models allow the pro- 
grammer to assess each alternative quantitatively during the design process, 
improving efficiency without requiring an inordinate amount of programming 
time. Portability of the efficiency is one of the chief problems in parallel pro- 
gramming, and cost models can help here by indicating where an algorithm 
should be modified to make effective use of a particular machine’s capabilities. 
Such motivations have led to a plethora of approaches to cost modeling. 

APMs (abstract parallel machines [E|) are an approach for describing 
parallel programming models, especially in the context of program transforma- 
tion. In this approach the parallel behavior is encapsulated in a set of ParOps 
(parallel operations), which are analogous to combinators in data parallel pro- 
gramming uni and skeletons in BMF An explicit separation is made be- 

tween the definitions of the ParOps and the specific parallel algorithm to be 

S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 16-03 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 
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implemented. An APM consists of a set of ParOps and a coordination language; 
algorithms are built up from the ParOps of one APM and are expressed using a 
coordination language for that APM. APMs are not meant as programming lan- 
guages; rather, they illustrate programming models and their relationships, and 
provide a basis for algorithm transformation. The relationships between different 
parallel operations can be clarified with a hierarchy of related APMs. 

There is some notion of costs already inherent in the definition of an APM, 
since the parallel operation definitions state how the operation is organized into 
parallel sites and what communications are required. The cost of an algorithm is 
determined by the costs of the ParOps it uses, and the cost of a ParOp could be 
derived from its internal description. This would allow a derivation to be based 
only on the information inside the APM definition. However, this is not the only 
way to obtain costs, and a more general and explicit treatment of costs can be 
useful. 

In this paper, we enrich the APM approach by adding an explicit hierarchy 
of cost models. Every APM can be associated with one or more cost models, 
reflecting the possibility that the APM could be realized on different machines. 
The cost models are related to each other, following the connections in the 
APM hierarchy. Each cost model gives the cost of every ParOp within an APM. 
There are several reasonable ways to assign a cost to a parallel operation: it 
could be inferred from the internal structure (using the organization into sites, 
communications and data dependencies); it could be obtained by transforming 
mathematically the cost of the corresponding operation in a related APM; it 
could be determined by measuring the real cost for a specific implementation. 

The goal is to support the transformation of an algorithm from one APM 
to another which gives automatically the new costs. Such a cost transformation 
could be used in several ways. The costs could guide the transformation of an al- 
gorithm through the APM hierarchy, from an abstract specification to a concrete 
realization. If the costs for an APM were obtained by measuring performance 
of a real machine, then realistic cost transformations are possible although the 
transformation takes place at an abstract level. In some algorithm derivations, 
it is helpful to begin with a horizontal transformation within the same APM 
that increases the costs. This can happen because the reorganized algorithm 
may satisfy the constraints required to allow a vertical transformation to a more 
efficient algorithm using a different APM. In such complex program derivations 
it is helpful to be explicit about the costs and programming models in use at 
each stage; that is the purpose of the APM methodology. 

The rest of the paper is organized as follows: Section 0 gives an overview of 
the APM approach. Section 0 introduces cost hierarchies to APM hierarchies. 
Sections E] and 0 illustrate the approach by examples. Section 0 concludes. 

2 Overview of APMs 

Abstract Parallel Machines (APMs) have been proposed in [Hj as a formal 
framework for the derivation of parallel algorithms using a sequence of transfor- 
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mation steps. The formulation of a parallel algorithm depends not only on the 
algorithmic-specific potential parallelism but also on the parallel programming 
model and the target machine to be used. Every programming model provides 
a specific way to exploit or express parallelism, such as data parallel models or 
thread parallelism, in which the algorithm has to be described. An APM de- 
scribes the behavior of a parallel programming model by providing operations 
(or patterns of operations) to be used in a program performed in that program- 
ming model. The basic operations provided by an APM are parallel operations 
(ParOps) which are combined by an APM-specific coordination language (usu- 
ally, e.g., including a composition function). The application is formulated for 
an APM with ParOps as the smallest indivisible parallel units to express a spe- 
cific application algorithm. Depending on the level of abstraction, an executable 
program (e.g., an MPI program for distributed memory machines) or a more 
abstract specification (e.g., a PRAM program for a theoretical analysis) results. 
The APM approach comprises: 

— the specification framework of ParOps defining the smallest units in a specific 
parallel programming model, see Section 12. L\ 

— APM definitions consisting of a set of ParOps and a coordination language 
using them, see Section 0 for an example; 

— a hierarchy of APMs built up from different APMs (expressing different 
parallel programming models) and relations of expressiveness between them; 

— the formulation of an algorithm within one specific APM, see also Section 2] 
for an example; and 

— the transformation of an algorithm into an equivalent algorithm (e.g., an 
algorithm with the same result semantics), but expressed in a different way 
within the same APM (horizontal transformation) or in a different APM 
(vertical transformation), see Section ^3 

In the following subsections, we describe the APM framework in more detail. 



2.1 Framework to Define a Parallel Operation ParOp 

A parallel operation ParOp is executed on a number of sites Pi, . . . ,P„ (these 
may be virtual processors or real processors). The framework for describing a 
ParOp uses a local function fi executed on site Pi using the local state Si of Pi for 
i = 1, . . . ,n and data provided by other sites Zi, . . . , Zn or input data xi, . . . ,Xr- 
Data from other sites used by Pi are provided by a projection function gi which 
selects data from the set V of available values, consisting of the inputs xi, ... ,Xr 
and the data zi, . . . , of all other sites, see FigureGl The result of a ParOp is a 
new state s'l, . . . , and output data yi, ... ,yr. A ParOp is formally defined by 



ParOp APG(si, ... ,s„) {xi,...,Xr) = ((s'l, . . . , 4), {yi,...,yt)) 
where {si,Zi) = fi{si, gi (F)) 

{yi, ...,yt) = g (V) 

V — ((xi , . . . , Xr) ^ Zij . . . j Zn) 



( 1 ) 
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Fig. 1. Illustration of the possible data flow of a ParOp. fi denotes the local computa- 
tion of site Pi. Qi chooses values from x, Ai, . . . , An. Only one projection function gi is 
depicted so as to keep the illustration readable. The arrows do not indicate a cycle of data 
dependencies since the value provided by /; need not be given back by gi. 



where ARG is a list of functions from {fi,...,fn,go,gi,...,gn} and contains 
exactly those functions that are not fully defined within the body of the ParOp. 
The functions fi, . . . , fn, go, gi, ■ . . , g-n in the body of the ParOp definition can 

— be defined as closed functions, so that the behavior of the ParOp is fully 
defined, 

— define a class of functions, so that details have to be provided when using 
the ParOp in a program, or 

— be left undefined, so that the entire function has to be provided when using 
the ParOp. 

The functions that have to be provided when using the ParOp appear in the 
function argument list ARG as formal parameters in the definition and as 
actual functions in a call of a ParOp. 

The framework describes what a ParOp does, but not necessarily how it is 
implemented. In particular, the gi functions imply data dependencies among the 
sites; these dependencies constrain the set of possible execution orders, but they 
may not fully define the order in an implementation. Consequently, the cost 
model for a ParOp may make additional assumptions about the execution order 
(for example, the degree of parallelism). 

2.2 Using APMs in a Program 

To express an application algorithm, the parallel operations defined for a specific 
APM are used and combined according to the coordination language. When 
using a ParOp in a program, one does not insert an entire ParOp definition. 
Instead, the operation is called along with any specific function arguments that 
are required. Whether function arguments are required depends on the definition 
of the specific ParOp. 
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If the function arguments fi, gi, are fully defined as functions in closed form, 
then no additional information is needed and the ParOp is called by just using 
its name. If one or more functions of ft, gi, i = I, . . . ,n, are left undefined, then 
the call of this ParOp has to include the specific functions possibly restricted 
according to the class of functions allowed. A call of a ParOp has the form 

ParOp ARG 

where ARG contains exactly those functions of (/i, . . . , fn){go, ■ ■ ■ ,gn) that are 
needed. This might be given in the form 

/ki = definition, . . . , /k; = definition, 

= definition, . . . , 5^^ = definition, 
ParOp(/«,,.../„,,g^i,...g^J 



2.3 Vertical and Horizontal Transformations between APMs 

One goal of the ARM approach is to model different parallel programming models 
within the same framework so that the relationship between two different models 
can be expressed. The relationship between two parallel programming models is 
based on the expressiveness of the APMs which is captured in the ParOps and 
the coordination language combining the ParOps. 

We define a relation between two different APMs in terms of a transforma- 
tion mapping any program for an APM M\ onto a program for an APM M2. 
The transformation is built up according to the structure of an APM program; 
thus it is defined on the ParOps of APM Mi and then generalized to the en- 
tire coordination language. The transformation of a ParOp is based on its result 
semantics, i.e., the local data and output produced from input data at a given 
local state. 

An APM Ml can be simulated by another APM M2 if for every ParOp F 
of Ml there is ParOp G (or a sequence of ParOps Gi, . . . ,Gi) which have the 
same result semantics as F, i.e., starting with the same local states si, . . . , 
and input data x it produces the same new local states s'l, . . . ,s'^ and output 
data y. 

If an APM Mi can be simulated by an APM M2, this does not necessarily 
mean that M2 can be simulated by Mi. If Mi can be simulated by M2, then 
Ml is usually more abstract than M2. Therefore, we arrange Mi and M2 in a 
hierarchical relationship with Mi being the parent node of M2. Considering an 
entire set of APMs, we get a tree or a forest showing a hierarchy of APMs and 
the relationship between them, see Figure 0 . 

The relationship between APMs serves as a basis for transforming an al- 
gorithm expressed on one APM to the same algorithm now expressed in the 
second related APM. For two related APMs Mi and M2 a transformation oper- 
ation from Ml to M2 is defined according to the simulation relation, i.e., 
for each ParOp F of APM Mi 

K:{F) = G {orT^l{F) = Gi,...,Gi) 
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Derivation sequence of an algorithm 
Aj A3 A7 A9 



Refinement of abstract parallel machines 
(^M J 



/ X constraint x ^ 

— Tapm) 



Reahzation of A-i according to(APN^ 
(APM ]) ^ Ai 



Hierarchy of abstract machines 



Derivation of an algorithm 



Explanation of the notation 



Fig. 2. Illustration of a hierarchy of abstract parallel machines and a derivation of an 
algorithm according to the hierarchy. 



where G is the ParOp of M2 to which F is related. In this kind of transformation 
step (which is called a vertical transformation), the program Ai is left essentially 
unchanged, but it is realized on a different APM: thus (Ai, Mi) is transformed to 
(A(, M2). The operations in Ai are replaced by the transformation T^^(F-^), 
so that A2 uses the parallel operations in M2 which realize the operations used 
by Ai. 

There can also be a second kind of transformation (called a horizontal trans- 
formation) that takes place entirely within one APM Mi: (Ai,Mi) is trans- 
formed into (^2 ,Mi), where a correctness-preserving transformation must be 
used to convert Ai into A2 . In the context of our methodology, this means that 
a proof is required that for all possible inputs A°, . . . , and states a, the two 
versions of the algorithm must produce the same result, i.e. 

Ai(X0,...,A«,a) = A2(A°,...,X",a). 

There are several approaches for developing parallel programs by performing 
transformation steps, many of which have been pursued in a functional pro- 
gramming environment. Transformations based on the notion of homomorphism 
and the Bird-Meertens formalism are used in m- P3L uses a set of algorithmic 
skeletons like pipelines and worker farms to capture common parallel program- 
ming paradigms . A parallel functional skeleton technique that emphasizes the 
data organization and redistributions is described in (El. A sophisticated ap- 
proach for the cost modeling of the composition of skeletons for a homomorphic 
skeleton system equipped with a equational transformation system is outlined in 
pnjEi]. The costs of the skeletons are required to be monotonic in the costs of 
the argument functions. The method performs a stepwise application of rewrit- 
ing rules such that each application of a rewriting rule is cost-reducing. All these 
approaches restrict the algorithm to a single programming model, and they use 
the costs only to help select horizontal transformations. Vertical transformations 
between different programming models which could be used for the concretiza- 
tion of parallel programs are not supported. 



22 



John O’Donnell, Thomas Rauber, and Gudnla Riinger 




Fig. 3 . Illustration of the connection between APMs, their corresponding costs and al- 
gorithms expressed using those APMs. The hierarchy contains only APMi and APM2 with 
associated cost model costsi and costS2. An algorithm A can be expressed within APMi 
and is transformed horizontally into algorithm B within the same APM which is then 
transformed vertically into algorithm C within APM2. 



3 Cost Hierarchies 

The APM method proposed in has separated the specifics of a parallel pro- 
gramming model from the properties of an algorithm to be expressed in the 
model. Since the APMs and the algorithm are expressed with a similar formal- 
ism, and the relations between APMs are specified precisely, it is possible to 
perform program transformations between different parallel programming mod- 
els. In this section, we enrich the APM approach with a third component to 
capture costs, see Figure 0. The goal is to provide information that supports 
cost-driven transformations. 



3.1 Cost Models for Leaf Machines 

We consider an APM hierarchy whose leaves describe real machines with a pro- 
gramming interface for non-local operations. These would include, for example, 
a communication library for distributed memory machines (DMMs) or a coor- 
dination library for accessing the global memory of shared memory machines 
(SMMs) concurrently. Each operation of the real parallel machine is modeled by 
an operation of the corresponding leaf APM. By measuring the runtimes of the 
operations on the real parallel machine, the APM operations can be assigned 
costs that can be used to describe costs of a program. Since the execution times 
of many operations of the real machine depend on a number of parameters, the 
costs of the APM operations are described by parameterized runtime functions. 
For example, the costs of a broadcast operation on a DMM depend on the num- 
ber p of participating processors and the size n of the message to be broadcast. 
Correspondingly, the costs are described by a function 

tbroad{p,n) = f{p,n) 

where / depends on the specific parallel machine and the communication library. 
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Usually, it is difficult to describe exactly the execution time of operations 
on real parallel machines. The local execution times are difficult to describe, 
since the processors may have a complicated internal architecture including a 
memory hierarchy, several functional units, and pipelines with different stages. 
Moreover techniques like branch prediction and out-of-order-execution may be 
used. The global execution times may be difficult to describe, since, for example, 
a distributed shared memory machine uses a physically distributed memory and 
emulates a shared memory by several caches, using, e.g., a directory-based cache 
coherence protocol. But for many (regular) applications and many DMMs, it is 
possible to describe the execution times of the machines accurately enough to 
compare different execution schemes for the same algorithm The main 

concern of this article is not so much to give an accurate model for a specific 
(class of) parallel machines, but rather to extend an existing model so that it 
can be used at a higher programming level to compare different implementations 
of the same algorithm or to guide program transformations that lead to a more 
efficient program. 

3.2 Bottom-Up Construction of a Cost Hierarchy 

Based on the cost models for the leaves of an APM hierarchy, cost models for 
the inner nodes of an APM hierarchy can be derived step by step. We consider 
an APM Mi which is the parent of an APM M2 for which a cost measure C2 
has already been defined. At the beginning of the derivation M2 has to be a 
leaf. Since Mi is the parent of M2, there is a transformation which assigns 
each parallel operation F of Mi an equivalent sequence Gi, . . . ,Gi of parallel 
operations, each of which has assigned a cost C2{Gi). We define a cost measure 
Cmi^M2 based on the cost measure G2 for M2 by 

i 

Gm,^M2{F) = J2C2{G,). ( 2 ) 

i=l 

A cost measure C2 for M2 may again be based on other cost measures, if M2 
is not a leaf. If the programmer intends to derive a parallel program for a real 
parallel machine R which is a leaf in the APM hierarchy, each intermediate level 
APM is assigned a cost measure that is based on the cost of R, i.e., the selection 
of the cost measure is determined by the target machine. Thus, for each path 
from a leaf to an inner node B there is a possibly different cost measure. 

We now can define the equivalence of cost measures for an inner node M 
of an APM hierarchy with children Mi and M2. Cost measures Gm^Mi and 
Cm^M2 for APM M can be defined based on cost measures for Ci and G2 of 
Ml and M2, respectively. We call Gm^Mi and Cm^M 2 equivalent if for arbitrary 
programs Ai and A2, the following is true: 

If C'm^M 2 (^i) < Gm^M 2 {^ 2 ) then C'm^Mi(Ai) < Cm^Mi{A 2 ) 

and vice versa. If two cost measures are equivalent, then both measures can be 
used to derive efficient programs and it is guaranteed that both result in the 
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same program since both have the same notion of optimality. Note that the 
equivalence of two cost measures for an APM does not require that they yield 
the same cost value for each program. 

3.3 Monotonicity of Cost Measures 

The cost measure for an APM can be used to guide horizontal transformations. 
For this purpose, a cost measure must fulfil the property that a horizontal trans- 
formation that is useful for an APM M is also useful for all concretizations of 
M . This property is described more precisely by the notion of monotonicity of 
cost measures. We consider APMs M2 and M\ where M2 is a child of Mi in 
the APM hierarchy. Let Ai and A2 be two programs for APM Mi where A2 is 
obtained from A± by a horizontal transformation Tmi which reduces the costs 
according to a cost measure Ci, i.e., 

Ci(Ai) > C,{A2) = Ci{TmMi))- 

Let be the vertical transformation from Mi to M2, i.e., the corresponding 
programs to Ai and A2 on APM M2 are A'^ = T^^(Ai) and A'2 = T^^(A2) 
respectively. Both these programs can be assigned costs according to a cost 
measure C2 of APM M2. The cost measures C\ and C2 are consistent only if the 
increase in efficiency that has been obtained by the transformation from A\ to 
A2 carries over to APM M2, i.e., only if 

C2{T^I{A,)) > C2{t^:{A2)). 

This property is captured by the following definition of monotonicity. The trans- 
formation is monotonic with respect to the costs C\ and C2, if for arbitrary 
programs A\ and A2 

Ci(Ai) > CM2) implies C2{T^l{Ai)) > C2{T^l{A2)). 

The bottom-up construction of cost measures according to an APM hierarchy 
creates monotonic cost measures; this can be proven by a bottom-up induction 
over the APM tree using definition J 2 I) of cost measures. In the next section, we 
describe the PRAM model with different operation sets in the APM methodol- 
ogy. For this example, we do not use the bottom-up cost definition but use the 
standard costs of the PRAM model. 

3.4 Other Cost Models 

One of the most popular parallel cost models is the PRAM model jSl and its 
extensions. Because none of the PRAM models was completely satisfactory, a 
number of other models have been proposed that are not based on the existence 
of a global memory, including BSP and logP [b] . Both provide a cost calcu- 
lus by modeling the target architecture with several parameters that capture its 
computation and communication performance. The supersteps in the BSP model 



Cost Hierarchies for Abstract Parallel Machines 



25 



allow for a straightforward estimation of the runtime of complete programs El. 
Another cost modeling method is available for the skeleton approach used in the 
context of functional programming dCl In the skeleton approach, programs 
are composed of predefined building blocks {data skeletons) with a predefined 
computation and communication behavior which can be combined by algorith- 
mic skeletons capturing common design principles like divide and conquer. This 
structured form of defining programs enables a cost modeling according to the 
compositional structure of the programs. A cost modeling with monads has been 
used in the GOLDFISH system El- In contrast to those approaches, APM costs 
allow the transformation of costs from different parallel programming models, 
so they can be used to guide transformations between multiple programming 
models. 

4 APM Description of PRAMs 

The PRAM (parallel random access machine) is a popular parallel programming 
model in theoretical computer science, widely used to design and analyze par- 
allel algorithms for an idealized shared memory machine. A PRAM consists of 
a bounded set of processors and a common memory containing a potentially 
unlimited number of words E|. Each processor is similar to a RAM that can 
access its local random access memory and the common memory. A PRAM algo- 
rithm consists of a number of PRAM steps which are performed in a SIMD-like 
fashion; i.e., all processors needed in the algorithm take part in a number of 
consecutive synchronized computation steps in which the same local function is 
performed. One PRAM step consists of three parts: 

1. the processors read from the common memory; 

2. the processors perform local computation with data from their local memo- 
ries; and 

3. the processors write results to the common memory. 

Local computations differ because of the local data used and the unique identi- 
fication number idi of each processor Pi, for i = 1, ... ,n (where n is the number 
of processors) . 

4.1 An APM for PRAMs 

There are many ways to describe the PRAM within the APM framework. In 
the PRAM model itself, the processors perform operations that cause values 
to be obtained from and sent to the common memory, but the behavior of the 
common memory itself is not modeled in detail. It is natural, therefore, to treat 
each PRAM processor as an APM site, and to treat the common memory as an 
implicit agent which is not described explicitly. The APM framework allows this 
abstract picture: transactions between the processors and the common memory 
are specified as Input/Output transactions between the APM operations and 
the surrounding environment. 
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The APM description of the PRAM model provides three ParOps for the 
PRAM substeps and a coordination language that groups the ParOps into steps 
and composes different steps while guaranteeing the synchronization between 
them. No other ParOps are allowed in a PRAM algorithm. The specific com- 
putations within the parallel operations needed to realize a specific algorithm 
are chosen by an application programmer or an algorithm designer. These local 
functions and the PRAM ParOps constitute a complete PRAM program. 

The three ParOps of a PRAM step are READ, EXECUTE and WRITE. The 
state (si, . . . , Sn) denotes the data st in the local memory of processor Pi {i = 
1 . . . n) involved in the PRAM program. The data from the common memory 
are the input (a:i, . . . ,Xr) to the local computation and the data written back 
to the common memory are the output (r/i, . . . , yr) of the computation. 

1. In the read step, data from the common memory are provided and for each 
processor Pi the function gi picks appropriate values from (xi, . . . , Xr) which 
are then stored by the local behavior function fi = store in the local memory, 
producing a new local state s'. There are no internal values produced in this 
step, so a dummy placeholder _ is used for the Zi term. The exact behavior of 
a specific READ operation is determined by the specific gi functions, which 
the programmer supplies as an argument to READ. Thus the ParOp defined 
here is READ {gi , . . . , 5 „). 

READ (gi, . . . ,g„) (si,...,s„) {xi,...,Xr) = ((s'i,...,s'„), (.)) 
where (s',_) = store(si, gi (V)) 

V = ((xi,...,Xr), 

= g(V) 

2. In a local computation, each processor applies a local function fi to the 
local data Si in order to produce a new state s'. The substep for the local 
computation does not involve the common memory, so the input and output 
vectors x and y are empty. In this operation, the programmer must supply 
the argument function /, which determines the behaviors of the sites. 

EXECUTE (/,...,/) (si,...,s„) (.) = ((s'i,...,s'J, ()) 
where (s',_) = /(si,_) 

R = 

= 9{V) 

3. At the end of a PRAM step, data from the local states (si, . . . , s„) are writ- 
ten back to the common memory. Each local function fi selects data Zi from 
its local state s^. From those data {z\, . . . , Zn) the function g forms the vector 
(yi, . . . , yr), which is the data available in the common memory in the next 
step. If two different processors select the same variable d, then the func- 
tion go is capable of modelling the different write strategies corresponding 
to different PRAM models [Ej. The programmer specifies the local fetch 
functions fi to determine the values that are extracted from the local states 
in order to be sent to the common memory. 
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WRITE (si,...,s„) (_) = 

where {s'i,Zi) = ft (si,_) 

V = (0,01, . . . ,z„) 

(yi,...,t/t) = g (V) 

The function g((-), Ai, . . . , An) = {A \, . . . , An) produces an output vector 
with one value from each processor in the order of the processor numbers. 

The PRAM steps can be combined by a sequential coordination language 
with for-loops and conditional statements. The next subsection gives an example. 



4.2 Example Program: PRAM Multiprefix 

More complex operations for PRAMs can be built up from the basic PRAM 
step. This section illustrates the process by defining the implementation of a 
multiprefix operation on a PRAM APM that lacks a built-in multi-prefix oper- 
ation. Initially, the common memory contains an input array X with elements 
Xi, for 0 < * < n; after the n sites execute a multiprefix operation with addition 
as the combining operation, the common memory holds a result array B, where 
for 0 < i < n. 

Several notations are used to simplify the presentation of the algorithm and 
to make it more readable, see Figure E| (right). Upper case letters denote vari- 
ables in the common memory, while lower case letters are used for the local site 
memories. The / and g functions are specified implicitly by describing abstract 
PRAM operations; the low level definitions of these functions are omitted. Thus 
the notation READ bi := Xi means that a PRAM READ operation is performed 
with / and g functions defined so as to perform the parallel assignment. Similar 
conventions are used for the other operations, EXECUTE and WRITE. Further- 
more, a constraint on the value of i in an operation means that the operation 
is performed only in those sites where the constraint is satisfied; other sites do 
nothing. (Again, the constraints are implemented through the definitions of the 
/ and g functions.) 

Figure El (left) gives the program realizing the multiprefix operation |2j. The 
algorithm first copies the array X into Y. This is done in 0(1) time by perform- 
ing a parallel READ that stores Xi into site i, enabling the values to be stored in 
parallel into Y with a subsequent WRITE. Then a loop with logn steps is exe- 
cuted. In step j, each processor Pi (for 2^ < i < n) reads the value accumulated 
by processor Pi_ 2 J , and it adds this to its local value of bi. The other processors. 
Pi for 0 < i < 2-1, leave their local bi value unchanged. The resulting array is 
then stored back into Y in the common memory. All operations are executed in 
the READ/EXECUTE/WRITE scheme required by the PRAM model. The ini- 
tialization, as well as each of the logn loop iterations, requires 0(1) time, so the 
full algorithm has time cost O(logn). 
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READ bi Xi is an abbreviation for 


Initially: inputs are in Xq, . . . , Xn-i. 




READ with 


Result: Yi = X/j-o 0 < i < n 




gi{{xo, ■ . ■ = Xi 

s'i = Si[xi/bi], i = 0, . . . ,n - 1 


procedure parScanl (©,n, A, F) 




EXECUTE bi := Xi + bi abbreviates 


READ bi ~ Xi for 0 < i < n 




EXECUTE with 


EXECUTE _ 




/(Si,_) = {s'i,.) 


WRITE Yi := bi for 0 < i < n 




s'i = Si[xi + bi/bi] 


for j 0 to log n — 1 do 




WRITE Yi ~ bi is an abbreviation for 


READ Xi := Yi_ 2 j for 2^ < i < n 




WRITE with 


EXECUTE bi := Xi + bi for 2^ < i < n 




/(Si,_) = {s'i,bi) 


WRITE Yi — bi for 0 < i < n 




5((_),feo,...,fe„_i) = (Fo,...,F„_i) 
with Yi — bi, i = 1, . . . ,n — 1 



Fig. 4. Realization of a multiprefix operation expressed as an algorithm in a PRAM-APM. 



4.3 PRAM* with Built-In Multiprefix 

The PRAM model can be enriched with a multiprefix operation (also called 
parallel scan), which is considered to be one PRAM step. The steps of a program 
written in the enriched model may be either multiprefix operations or ordinary 
PRAM steps. The multiprefix operation with addition as a combining operation 
(MPADDL) has the same behavior as the parScanl procedure in Figure 0] (left); 
the only difference is that it is defined as a new PRAM ParOp, so its definition 
does not use any of the other ParOps READ, EXECUTE or WRITE. 

MPADDL (so,..., sn-i) (Ao,...,A„_i) = ((s(„ . . . , s'„_i), (Fq, . . . , I"n-i)) 
where (s'i,&i) = fi (so 9i V) 

V = {{Xq, ■ ■ ■ ,Xn-l),bo, ■ ■ ■ ,bn-l) 

where the functions /o, . . . , fn-i,g, ■ ■ ■ , gn-i are defined as follows: 



gi{{Xo , . . . , Xn-i),bo , . . . , bn-i) — (Aq, . . . , Xi-i) 

i 

fi{si, (Ao, . . . , Ai_i)) = ((s',&i) with bj = y^Xj,i=l,...,n-l 

j=o 

g{{Xo, . . . , Xn-i),bo, . . . ,bn-i) = (Yq, . . . ,Yn-i) with Yi := bi,i = 1, . . . ,n - 1 

A family of related operations can be treated as PRAM primitives in the 
same way. In addition to multiprefix addition from the left (MPADDL), we can 
also form the sums starting from the right (MPA DDR). There are correspond- 
ing operations for other associative functions, such as maximum, for which we 
can define MPMAXL and MPMAXR. Reduction (or fold) ParOps for associative 
functions are also useful, such as FOLDMAX, which finds the largest of a set of 
values chosen from the sites. Several of these operations will be used later in an 
example (Section 0. 
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Maximum Segment Sum 



Costs 
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MSS sequential 'MSS sequential 0(n) ' 0(n log n ) 
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PRAM-APM 



MSS parallel 
with prefix 



0( log n) 
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MSS parallel 

with built-in prefix 



0(1 ) 



Fig. 5. Illustration of the transformation steps of the maximum segment sum. 

4.4 PRAM Costs 

The PRAM model defines the cost of an algorithm to be the number of PRAM 
steps that have to be performed by that algorithm, i.e., a PRAM step has cost 
1. Thus, the cost of the multiprefix operation for the PRAM without a built-in 
multiprefix operation is the number of steps executed by the implementation 
in Figure 0 In the PRAM with built-in multiprefix, the cost for a multiprefix 
operation is 1. In Section IrT^ we have described a way to define costs starting 
from the leaves of the APM hierarchy. This is also possible for the PRAM and 
is useful if algorithms have to be transformed to a real machine. An example of 
a real machine with built-in multiprefix is the SB-PRAM PJ, a multi-threaded 
architecture supporting multiprefix operations for integers in hardware. On this 
machine, each PRAM step and each multiprefix operation takes two cycles, in- 
dependently from the data values contributed by the different processors. This 
can then be used as the cost measure of a leaf machine. 

5 Example: MSS Algorithm Transformation 

This section illustrates how an algorithm can be transformed within the APM 
framework in order to improve its efficiency. Figure 0 shows the organization of 
the transformation, which moves from the sequential RAM to the parallel PRAM 
models, and which includes both horizontal and vertical transformations. 

We use as an example a version of the well-known maximum segment sum 
(MSS) problem similar to that presented by Akl 0 . Given a sequence of numbers 
X = Xq, . . . , A„_i, the problem is to find the largest possible sum mss of a 
contiguous segment within X. Thus we need the maximal value of Xi 

such that 0 < u < V < n. Using prefix sums and maxima, the problem can be 
solved by the following steps: 

1. For * = 0, . . . , n — 1 compute Si = ^j- 

2. For i = 0, . . . , n — 1 compute Mi = maxi<j<„ Sj; let at be the value of j at 
which Mi is found. 



30 John O’Donnell, Thomas Rauber, and Gudnla Riinger 

3. For * = 0, . . . , n — 1 compute Bi = Mi — Si + Xi. 

4. compute mss = maxo<i<n-i Bi] if u is the index at which the maximum 
is found, the maximum sum subsequence extends from u to v = Qu- 

The first version of the algorithm (Figure El left) is written in a conventional 
imperative style, except that the RAM programming model is made explicit. 
The RAM has only one ParOp, an EXECUTE operation, and the RAM-APM 
has only one site. The time of the algorithm is obtained by observing that the 
number of EXECUTE operations performed is 0{n), and each takes time 0(1) 
(this result can be obtained either by analyzing the RAM APM, or by assuming 
constant time for this leaf machine operation) . 



Initially: inputs are in Xq, . . . , Xn-i- 
Result: mss = max Xj\ 

for 0 < M < f < n 

EXECUTE a := 0 
for i := 0 to n — 1 do 
EXECUTE Si~a + Xi 
EXECUTE a ■- Si 
EXECUTE a ■- NEGINF 
for i ■.= n — 1 downto 0 do 
EXECUTE Mi ■- max {Si, a) 
EXECUTE a ■- Mi 
for i := 0 to n — 1 do 

EXECUTE Bi ■- Mi - Si + Xi 
EXECUTE mss := NEGINF 
for i := 0 to n — 1 do 

EXECUTE mss = max {mss, Bi) 



procedure seqGopy (n, Y, Z) 
for i := 0 to n — 1 do 
EXECUTE Yi ■- Zi 
procedure seqScanl (/, n. A, B) 
seqCopy {n, B, A) 
for j := 0 to log n — 1 do 
for i ■.= 2^ to n — 1 do 

EXECUTE Bi :=f{Bi_^i, Bi) 
procedure seqScanr (/, n. A, B) 
seqCopy (n, B, A) 
for j := 0 to log n — 1 do 

for i ~ n — 1 — 2^ downto 0 do 
EXECUTE B, ~ f {Bi, Bi+^i) 
function seqFoldll (/, a, n. A) 

EXECUTE q--a 
for i := 1 to n — 1 do 
EXECUTE q~ f {q,Ai) 
return q 
begin 

seqScanl (+, n, X, S) 
seqScanr {max, n, S, M) 
for i := 0 to n — 1 do 

EXECUTE Bi ■- Mi-Si+ Xi 
mss := seqFoldll {max, NEGINF, n, B) 
end 



Fig. 6. Algorithm 1 on the left: the sequential MSS expressed within the RAM-APM 
with time complexity 0{n) (NEGINF is the absolute value of the largest negative number.) 
and Algorithm 2 on the right: the sequential MSS using a sequential multiprefix operation 
expressed within the same . RAM-APM with 0(n log n) time. 



The aim is to speed up the algorithm using a parallel scan on the PRAM 
model. To prepare for this step, we first perform a horizontal transformation that 
restructures the computation, making it compatible with the prefix operations. 
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This results in Algorithm 2 (Figure 0 right). It is interesting to observe that the 
transformation has actually resulted in a slower algorithm; its benefit is to put 
us into a position to perform further transformations that will more than make 
up for this slow-down. 

We now perform a vertical transformation from the RAM to the PRAM 
models, producing Algorithm 3 (Figure | 7 | (left)). As usual with vertical transfor- 
mations, there is very little change to the structure of the algorithm; the main 
effect of the transformation is to use the PRAM to perform the iterations without 
data dependencies in 0(1) time. Algorithm 3 is still using the basic PRAM, so 
the parallel prefix operations require O(logn) time. The final step. Algorithm 4 
(Figure 0 (right)), is produced by performing a vertical transformation onto the 
PRAM* model, which supports parallel prefix operations as built-in operations 
with a cost of 0(1) time. 



procedure parScanl (/, n, A, B) 
Defined in Figure 4 
procedure parScanr (/, n. A, B) 
Similar to parScanl 
function parFold (/, n. A) 

Similar to parScanl 
begin 

parScanl {+,n, X, S) 
parScanr {max, n, S, M) 

READ mi := Mi, Si := Si,Xi := Xi 
for 0 < i < n 

EXECUTE bi := mi — Si + Xi 
for 0 < i < n 

WRITE Bi bi for 0 < i < n 
mss := parFold {max, n, B) 

end 



MPADDL {n,X,S) 




MPMAXR {n,S,M) 




READ mi := Mi,Si := 


II 


for 0 < i < n 




EXECUTE bi ■- mi - 


Si -F Xi 


for 0 < i < n 




WRITE Bi ■- bi for 0 


< i < n 


mss ~ FOLDMAX {n 


B) 



Fig. 7. Algorithm 3 on the left: MSS Parallel Scan. PRAM APM, O(logn) time. Algorithm 
4 on the right: MSS Parallel Scan. PRAM* APM, 0(1) time. 



6 Conclusion 

The three components of the APM methodology — an APM with its ParOps, the 
ParOp cost models and the algorithm — reflect the abstract programming model, 
the architecture of a parallel system, and the parallel program. The architecture 
is represented here very abstractly in the form of costs. 

In general, all three components of the methodology are essential. The costs 
are an important guide to the design of an efficient parallel algorithm. Sepa- 
rating the costs from the APMs allows several different ones to be associated 
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with the same operation, representing the same functionality implemented on 
different machines. However, it is not enough just to keep the semantics of the 
APM parallel operations and their costs: the APM definitions are still needed, 
as they make explicit the organization of data and computation into sites. Ef- 
ficient algorithm design for parallel machines must consider not only when an 
computation on data is performed, but also where. An example is programming 
on a distributed memory machine where a poorly chosen distribution of data to 
sites may cause time-consuming communication. 

We therefore conclude that all three components of the methodology are 
essential, but they should be separated from each other and made distinct. For 
solving a particular problem, some parts of the structure may not be needed, 
and can be omitted. For example, a programmer may find the intuition about 
cost provided by the ParOp definitions is sufficient, in which case the separate 
cost model structure is unnecessary. 
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Abstract. This paper presents recursion unrolling, a technique for im- 
proving the performance of recursive computations. Conceptually, re- 
cursion unrolling inlines recursive calls to reduce control flow overhead 
and increase the size of the basic blocks in the computation, which in 
turn increases the effectiveness of standard compiler optimizations such 
as register allocation and instruction scheduling. We have identihed two 
transformations that significantly improve the effectiveness of the ba- 
sic recursion unrolling technique. Conditional fusion merges conditionals 
with identical expressions, considerably simplifying the control flow in 
unrolled procedures. Recursion re-rolling rolls back the recursive part of 
the procedure to ensure that a large unrolled base case is always exe- 
cuted, regardless of the input problem size. 

We have implemented our techniques and applied them to an important 
class of recursive programs, divide and conquer programs. Our exper- 
imental results show that recursion unrolling can improve the perfor- 
mance of our programs by a factor of between 3.6 to 10.8 depending on 
the combination of the program and the architecture. 



1 Introduction 

Iteration and recursion are two fundamental control flow constructs. Iteration 
repeatedly executes the loop body, while recursion repeatedly executes the body 
of the procedure. Loop unrolling is a classical compiler optimization. It reduces 
the control flow overhead by producing code that tests the loop termination 
condition less frequently. By textually concatenating copies of the loop body, it 
also typically increases the sizes of the basic blocks, improving the effectiveness 
of other optimizations such as register allocation and instruction scheduling. 

This paper presents recursion unrolling, an analogous optimization for re- 
cursive procedures. Recursion unrolling uses a form of procedure inlining to 
transform the recursive procedures. Like loop unrolling, recursion unrolling re- 
duces control flow overhead and increases the size of the basic blocks in the 
computation. But recursion unrolling is somewhat more complicated than loop 
unrolling. The basic form of recursion unrolling reduces procedure call overheads 
such as saving registers and stack manipulation. We have developed two transfor- 
mations that optimize the code further. Conditional fusion merges conditionals 

S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 34-^3 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 



Recursion Unrolling for Divide and Conquer Programs 



35 



with identical expressions, considerably simplifying the control flow in unrolled 
procedures and increasing the sizes of the basic blocks. Recursion re-rolling rolls 
back the recursive part of the procedure to ensure that a large unrolled base case 
is always executed, regardless of the problem size. 



1.1 Divide and Conquer Programs 

We have applied recursion unrolling to divide and conquer programs 0 . 

Divide and conquer algorithms solve problems by breaking them into smaller 
subproblems and recursively solving the subproblems. They use a base case com- 
putation to solve problems of small sizes and terminate the recursion. 

Divide and conquer algorithms have several appealing properties that make 
them a good match for modern parallel machines. First, they tend to have a lot of 
inherent parallelism. The recursive structure of the algorithm naturally leads to 
recursively generated concurrency, which typically generates more than enough 
concurrency to keep the machine busy. Second, divide and conquer programs also 
tend to have good cache performance and to automatically adapt to different 
cache sizes and hierarchies. As soon as a subproblem fits into one level of the 
memory hierarchy, the program runs out of that level or below until the problem 
has been solved. 

To fully exploit these properties, divide and conquer programs have to be 
efficient and execute useful computation most of the time, rather than spend- 
ing substantial time on dividing problems into subproblems, or on combining 
subproblems. The size of the base case controls the balance between the compu- 
tation time and the divide and combine time. If the base case is too small, the 
program spends most of its time in the divide and combine phases instead of per- 
forming useful computation. Unfortunately, the simplest and least error-prone 
coding styles reduce the problem to its minimum size (typically a problem size 
of one) before applying a very simple base case. Programmers therefore typically 
start with a simple program with a small base case, then unroll the recursion 
by hand to obtain a larger base case with better performance. This manual re- 
cursion unrolling is a tedious, error-prone process that obscures the structure of 
the code and makes the program much more difficult to maintain and modify. 

The recursion unrolling algorithm presented in this paper automates the 
process of generating efficient base cases. It gives the programmer the best of 
both worlds: clean, simple divide and conquer programs with efficient execution. 



1.2 Conditional Fusion 

Since divide and conquer programs split the given problem into several subprob- 
lems, their recursive part typically contains multiple recursive calls. The size of 
the unrolled code therefore increases exponentially with the number of times the 
recursion is unrolled. Moreover, the control flow of the unrolled recursive proce- 
dure also increases exponentially in complexity. The typical structure of a divide 
and conquer program is a conditional with the base case on one branch and the 



36 



Radu Rugina and Martin Rinard 



recursive calls on the other branch. Recursion unrolling generates an exponen- 
tial number of nested if statements. To substantially simplify the control flow 
in the unrolled code, we apply a transformation called conditional fusion which 
merges conditional statements with equivalent test conditions. This transforma- 
tion simplifies the generated code and improves the performance by reducing the 
number of conditional instructions and coalescing groups of small basic blocks 
into larger basic blocks. 

1.3 Recursion Re-rolling 

Recursion unrolling increases the code size both for the base case and for the re- 
cursive part. Compared to the recursive part of the original recursive program, 
the recursive part of the unrolled procedure divides the given problem into a 
larger number of smaller subproblems. This has the advantage that several recur- 
sive levels are removed from the recursion call tree. But this accelerated division 
into subproblems may generate base case subproblems of small size, even when 
the recursion unrolling produces unrolled base cases for larger problem sizes. 
To ensure that the computation always executes the more efficient larger base 
case, we apply another transformation, recursion re-rolling, which replaces the 
recursive part of the unrolled procedure with the recursive part of the original 
program. 

1.4 Contributions 

This paper makes the following contributions: 

— Recursion Unrolling: It presents a new technique, recursion unrolling, for 
inlining recursive procedures. This technique iteratively constructs a set of 
unrolled recursive procedures. At each iteration, it conceptually inlines calls 
to recursive procedures. 

— Target programs: It shows how to use recursion unrolling for an impor- 
tant class of recursive programs, divide and conquer programs. Recursion 
unrolling is used for these programs to automatically generate more efficient 
unrolled base cases of larger size. 

— Code Transformations: It presents two new code transformations, con- 
ditional fusion and recursion re-rolling, that substantially improve the per- 
formance of recursive code resulting from recursion unrolling of divide and 
conquer programs. Both of these transformations reduce control flow over- 
head and increase the sizes of the basic blocks. 

— Experimental Results: It presents experimental results that characterize 
the effectiveness of the algorithms on a set of benchmark programs. Our re- 
sults show that the proposed code transformations can substantially improve 
the performance of our set of divide and conquer benchmark programs. 

The remainder of the paper is organized as follows. Section El presents a run- 
ning example that we use throughout the paper. Section El presents the analysis 
algorithms. Section El presents experimental results from our implementation. 
Section El discusses related work. We conclude in Section El 
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void dcIncCint *p, int n) { 
if (n == 1) { 

*p += 1 ; 

} else { 

dcIncCp, n/2) ; 

dcInc(p+n/2, n/2); 

} 

} 

Fig. 1. Divide and conquer array increment example 

2 Example 

Figure G1 presents a simple example that illustrates the kinds of computations 
that our recursion unrolling is designed to optimize. The dcinc procedure imple- 
ments a recursive, divide-and-conquer algorithm that increments each element 
of an array. In the divide part of the algorithm, the dcinc procedure divides 
each array into two subarrays. It then calls itself recursively to increment the 
elements in each subarray. After the execution of several recursive levels the pro- 
gram generates a subarray with only one element, at which point the algorithm 
uses the simple base case statement *p += 1 to directly increment the single 
element of the subarray. 

Reducing the problem to a base case of one is the simplest way to write 
the program. Larger base cases require a more complex algorithm, which in 
general can be quite difficult to correctly code and debug. But while the small 
base case is the simplest and easiest way to code the computation, it has a 
significant negative effect on the performance — the procedure spends most of 
its time dividing the problem into subproblems. The overhead of control flow, 
consisting of procedure calls and testing for the base case condition, overwhelms 
the useful computation. For each instruction that increments an array element, 
the computation executes at least one conditional instruction and one procedure 
call. To improve the efficiency of the program, the compiler has to reduce the 
control flow overhead. 

The compiler can achieve control flow elimination in two ways. Procedure in- 
lining can eliminate procedure call overhead. Fusing conditional statements with 
equivalent conditional expressions can eliminate redundant conditional state- 
ments. Our compiler applies both kinds of optimizations. 

2.1 Inlining Recursive Procedures 

The compiler first inlines the two recursive calls to procedure dcinc. Figure El 
shows the result of inlining these recursive calls. For this transformation, the 
compiler starts with two recursive copies of the original procedure dcinc, re- 
places their recursive calls with mutually recursive calls from one copy to the 
other, and then inlines one of them into the other. The resulting recursive pro- 
cedure dcinc I has two base cases: the original base case for n = 1 and a larger 
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void dcIncKint *p, int n) { 
if (n == 1) { 

*p += 1; 

y else { 

if (n/2 == 1) i 

*p += 1 ; 

} else {. 

dcIncKp, n/2/2) ; 

dcinci (p+n/2/2 , n/2/2); 

} 

if (n/2 == 1) i 
*(p+n/2) += 1; 

} else {. 

dcIncI(p+n/2, '0./2/2') \ 

dcIncI (p+n/2+n/2/2 , n/2/2); 

} 

> 

> 

Fig. 2. Program after inlining recur- 
sion 



void dcIncFCint *p, int n) { 
if (n == 1) { 

*p += 1; 

}■ else { 

if (n/2 == 1) { 

*p += 1 ; 

*(p+n/2) += 1; 

} else { 

dcIncF(p, n/2/2); 

dcIncF (p+n/2/2, n/2/2); 

dcIncF(p+n/2, n/2/2); 

dcIncF(p+n/2+n/2/2, n/2/2); 

} 

> 

Fig. 3. Program after conditional fu- 
sion 



base case for n/2 = 1 (textually, this larger base case is split into two pieces in 
the generated code). 

Compared to the original recursive procedure dcinc from Figure ^ the in- 
lined procedure dcinci also divides the given problem into a larger number 
of smaller subproblems. The inlined procedure generates four subproblems of 
quarter size, while the original procedure generates only two problems of half 
size. The transformation therefore eliminates half of the procedure calls in the 
dynamic call tree of the program. 



2.2 Conditional Fusion 

The inlined code in dcinci also contains more conditionals and basic blocks than 
the original recursive code. Since inlining has exposed more code in the body of 
the procedure, the compiler can now perform intra-procedural transformations 
to simplify the control flow of the procedure body. In Figure 0 the compiler 
recognizes that the two if statements have identical test conditions n/2 == 1. It 
therefore applies another transformation, called conditional fusion, to replace the 
two conditional statements with a single conditional statement. FigureElpresents 
the resulting recursive procedure dcIncF after this transformation. The true 
branch of the new conditional statement is the concatenation of the true branches 
of the initial if statements. Similarly, the false branch of the new conditional 
statement is the concatenation of the false branches of the initial if statements. 
The test condition of the merged conditional statement is the common test 
condition of the initial if statements. 
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void dclnc2(int *p, int n) { 
if (n == 1) { 

*p += 1; 

} else { 

if (n/2 == 1) { 

*p += 1 ; 

*(p+n/2) += 1; 

} else { 

if (n/2/2 == 1) { 

*p += 1 ; 

*(p+n/2/2) += 1; 
*(p+n/2) += 1; 
*(p+n/2+n/2/2) += 1; 
}■ else { 



dclnc2(p, n/2/2/2) 
dcInc2(p+n/2/2/2, n/2/2/2) 
dcInc2(p+n/2/2, n/2/2/2) 
dcInc2(p+n/2/2+n/2/2/2, n/2/2/2) 
dcInc2(p+n/2, n/2/2/2) 
dcInc2(p+n/2+n/2/2/2, n/2/2/2) 
dcInc2(p+n/2+n/2/2, n/2/2/2) 



dclnc2 (p+n/2+n/2/2+n/2/2/2 , n/2/2/2) 

> 

} 

} 

> 

Fig. 4. Program after second unrolling iteration 



2.3 Unrolling Iterations 

Because of the recursive structure of the program, the above transformations 
can be repeatedly applied. The recursive program dcIncF in Figure 0 represents 
the program after the first unrolling iteration. It performs the same overall com- 
putation as the original program dcinc, but it has a different internal structure. 

The compiler can now use dcIncF and dcinc to unroll the recursion further. 
Since the two procedures perform the same computation, the compiler can safely 
replace their recursive calls with mutually recursive calls between each other and 
then inline one of them into the other. The compiler can further apply conditional 
fusion on the resulting recursive procedure. It thus produces the result of the 
second unrolling iteration, the recursive procedure dclnc2 shown in Figure^ It 
has a bigger base case for n/2/2 == 1 and its recursive part divides the problem 
into an even larger number of smaller subproblems than dcIncF. 

The recursion unrolling process can continue now by transforming recursive 
procedures dcinc and dclnc2 into mutually recursive procedures, and then ap- 
plying the above transformations. The unrolling process stops when the number 
of iterations reaches the desired unrolling factor. 
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void dcIncR(int *p, int n) {. 
if (n == 1) i 

*p += 1 ; 

} else {. 

if (n/2 == 1) { 

*p += 1 ; 

*(p+n/2) += 1; 
y else 

if (n/2/2 == 1) i 

*p += 1; 

*(p+n/2/2) += 1; 
*(p+n/2) += 1; 
*(p+n/2+n/2/2) += 1; 

}■ else { 

dcIncR(p, n/2) ; 

dcIncR(p+n/2, n/2); 

> 

} 

} 

} 

Fig. 5. Program after re-rolling 



2.4 Re-rolling Recursion 

Inlining recursive procedures automatically unrolls both the base case and the 
recursive part. Depending on the input problem size, the unrolled recursive part 
may lead to small base case subproblems that do not exercise the bigger, unrolled 
base cases. For instance, for procedure dclnc2, if the initial problem size is 
n = 8, the recursive calls will divide the problem into subproblems of size n = 
1. Therefore, the bigger base case for n == 4 does not get executed. 

Since most of the time is spent at the bottom of the recursion tree, the goal 
of the compiler is to ensure that the bigger base cases are always executed. To 
obtain this goal, the compiler applies a final transformation, called recursion 
re-rolling, which rolls back the recursive part of the unrolled procedure. The 
result of re-rolling procedure dclnc2 is shown in Figure 0 in figure dcIncR. 
The compiler detects that the recursive part of the initial procedure dcinc is 
executed on a condition which is always implied by the condition on which the 
recursive part of the unrolled procedure dclnc2 is executed. The compiler can 
therefore safely replace the recursive part of dclnc2 with the recursive part of 
dcinc, thus rolling back only the recursive part of the unrolled procedure. Thus, 
the recursive part of the procedure is unrolled only temporarily, to generate the 
base cases. After the large base cases are generated, the recursive part is rolled 
back. 
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Algorithm RecursionUnrolling ( Proc /, Int m ) 



J 1 



(0) 



clone (f); 



unroll 




for (i=l; i < m; i++) 
rd) — r? 1 



Fig. 6. Top Level of the recursion unrolling algorithm 



3 Algorithms 

This section presents in detail the algorithms that enable the compiler to perform 
the transformations presented in the previous section. 

3.1 Top Level Algorithm 

Figure El presents the top level algorithm for recursion unrolling. The algorithm 
takes two parameters: a recursive procedure / to unroll, and an unrolling factor 
m. The algorithm will unroll / m times. The algorithm iteratively builds a set 
^ ~ if unroll I 0 ^ ^ ■'TT-I of uni'olled versions of the given recursive procedure. 

Different versions have base cases of different sizes. The internal structure of 
different versions is therefore different, but all versions perform the same com- 
putation. 

The algorithm starts with a copy of the procedure /. This is the version of 
/ unrolled zero times, f^roii- Then, at each iteration i, the algorithm uses the 
version f^^ron created in the previous iteration to build a new unrolled version 

f unroll of / ^Th a bigger base case. To create the new version, the compiler 
inlines the original procedure / into the version from the previous iteration. 
The recursion inlining algorithm performs the inlining of recursive procedures. 
It takes two recursive versions from the set S and inlines one into another. After 
inlining, the compiler applies conditional fusion to simplify the control flow and 
coalesce conditional statements in the new recursive version 

After it executes m iterations, the compiler stops the unrolling process. The 
last unrolled version has the biggest base case and the biggest recursive 

part. The compiler finally applies recursion re-rolling to roll back the recursive 



part of f[ 



(m) 



unroll ' 
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Algorithm Recursioninline ( Proc /I, Proc /2 ) 

Proc /3 = clone(/l); 

Proc /4 = clone(/2); 

foreach cstat £ CallStatements(/3, /3) do 
replace callee /3 in cstat with /4 

foreach cstat € CallStatements(/4, /4) do 
replace callee /4 in cstat with /3 

foreach cstat G CallStatements(/3, /4) do 
replace cstat with inlined procedure /4 

return /3 



Fig. 7. Recursion inlining algorithm 



3.2 Recursion Inlining 

The recursion inline algorithm takes two recursive procedures, /I and /2, which 
perform the same overall computation, and inlines one of them into the other. 
The result of this transformation is a recursive procedure with a base case bigger 
than any of the base cases of procedures fl and /2. 

Figure 0 presents the recursion inlining algorithm. Here, CallStatements(/, 
g) represents the set of procedure call statements with caller / and callee g. The 
compiler first creates two copies /3 and /4 of the parameter procedures fl and 
/2, respectively. It then replaces each recursive call in /3 and /4 with calls to the 
other procedure. Because fl and /2 perform the same computation, each of the 
new mutually recursive procedures /3 and /4 will perform the same computation 
as the original procedures fl and /2. With direct recursion translated into mu- 
tual recursion, each call statement has a different caller and callee. This enables 
procedure inlining at the mutually recursive call sites. The compiler therefore 
inlines the procedure /4 into /3. The resulting inlined version of /3 is a recursive 
procedure which performs the same computation as the given procedures fl and 
/2, but has a bigger base case and splits a given problem into a larger number 
of smaller subproblems. 

3.3 Conditional Fusion 

Conditional fusion is an intra-procedural transformation that merges conditional 
if statements with equivalent condition expressions. The conditional fusion al- 
gorithm searches the control flow graph of the unrolled procedure for consecutive 
conditional statements with this property. 
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Algorithm ConditionalFusion ( Proc / ) 

foreach meta-basic-block B 

in bottom-up traversal of / do 

Boolean failed = false 
Statement newcond 

foreach meta-statement stat in B do 
if ( not IsConditional(stot) ) then 
failed = true 
break 

else if ( IsEmpty(neuicond) ) 
newcond = clone (stat) 

else if ( not SameCondition(neMicond, stat) ) 
failed = true 
break 
else 

Append(neuicond.True, stat. True); 
Append( neuicond.False, stat.False) ; 

if (not failed) then 

replace B with newcond 

return / 



Fig. 8. Conditional fusion algorithm 



For detecting such patterns, a hierarchically structured control flow graph 
is more appropriate. A hierarchical control flow graph is a graph of meta-hasic- 
blocks. A meta-basic-block is a sequence of meta-statements. A meta-statement 
is either a program instruction, a conditional statement, or a loop statement. 
There is no program instruction that jumps in or out of a meta-basic-block. 
Bodies of loop statements and branches of conditional statements are, in turn, 
hierarchical control flow graphs. 

Using a hierarchical control flow graph representation, the conditional fu- 
sion algorithm is formulated as shown in Figure El The compiler traverses the 
hierarchical control flow structure in a bottom-up fashion. At each level, it in- 
spects the meta-statements in the current basic block B. It checks if all the 
meta-statements in B are conditional statements and if they all have equivalent 
condition expressions. If not, the failed flag is set to true and no transforma- 
tion is performed. When checking the equivalence of condition expressions, the 
compiler also verifies that the conditional statements do not write any of the 
variables of the condition expressions. This ensures that condition expressions of 
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Algorithm RecursionReRoll ( Proc /I, Proc /2 ) 

MetaBasicBlock B1 — RecursivePart(/l) 
MetaBasicBlock B2 — RecursivePart(/2) 

Boolean condl = RecursionCondition(/l) 
Boolean cond2 = RecursionCondition(/2) 

if ( condl implies cond2 ) then 

replace calls in B2 to fl with calls to f 2 
replace B1 with B2 in procedure fl 

return / 1 



Fig. 9. Recursion re-rolling algorithm 



different if statements refer to variables with the same values. As it checks the 
statements, the compiler starts building the merged if statement. If stat is the 
current conditional statement, the compiler appends its true branch to the true 
branch of the new conditional, and its false branch to the false branch of the 
new conditional. After scanning the whole basic block, if the flag failed is not 
set, the compiler replaces B with the newly constructed conditional statement. 

3.4 Recursion Re-rolling 

The recursion re-rolling transformation rolls back the recursive part of the un- 
rolled procedure, leaving the unrolled base case unchanged. It ensures that the 
largest unrolled base case is always executed, regardless of the input problem 
size. 

Figurel^presents the algorithm for recursion re-rolling. The algorithm is given 
two procedures which are versions of the same recursive computation. Procedure 
fl has an unrolled recursive part and procedure /2 has a rolled recursive part. 
To re-roll the recursion of /I, the compiler first identifies the recursive parts 
two procedures, B1 and B2 respectively. The recursive part of a procedure is 
the smallest meta-basic-block in the procedure that contains all the recursive 
calls and which represents the whole body of the procedure when executed. The 
compiler then detects the conditions on which the recursive parts are executed. 
If the condition condl on which the recursive part of the unrolled procedure is 
executed always implies the condition cond2 on which the rolled recursion of /2 
is executed, then the compiler performs the re-rolling transformation. Knowing 
that both fl and /2 perform the same computation, the compiler first replaces 
calls to /2 in B2 with calls to /I. It then replaces block B1 with block B2 to 
complete rolling back the recursive part of / 1 . 
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Machine 


Input 

Size 


On 


In 


Unrolling Types 
lu-tf lu-tfr 2u 


2u-tf 


2u-tfr 3u-tfr 


Hand 

Coded 


Pentium III 


512 


9.22 


3.41 


2.83 


2.97 


11.49 


9.34 


2.69 


2.55 


1.16 


Pentium III 


1024 


73.80 


69.43 


64.75 


23.61 


32.51 


24.63 


20.70 


20.47 


9.19 


PowerPC 


512 


14.35 


4.19 


3.28 


2.89 


17.32 


25.19 


1.63 


1.33 


0.59 


PowerPC 


1024 


114.60 


136.84 


137.20 


23.17 


33.87 


35.91 


12.91 


10.74 


4.69 


Origin 2000 


512 


30.57 


9.92 


6.77 


6.84 


30.54 


29.81 


3.95 


3.62 


1.24 


Origin 2000 


1024 


244.44 


239.64 


230.43 


54.59 


81.13 


58.55 


31.46 


28.73 


9.90 


Table 1. Running times 


of unrolled versions of Mul (seconds) 
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Unrolling Types 








Hand 




Size 


On 


In 


lu-tf 


lu-tfr 


2u 


2u-tf 


2u-tfr 3u-tfr 


Coded 


Pentium III 


512 


3.14 


2.15 


2.00 


0.99 


1.86 


1.32 


0.84 


0.83 


0.71 


Pentium III 


1024 


24.77 


14.62 


13.14 


8.53 


21.25 


15.86 


6.99 


6.58 


5.48 


PowerPC 


512 


4.88 


4.15 


4.01 


1.16 


2.41 


2.25 


0.73 


0.66 


0.70 


PowerPC 


1024 


39.16 


26.58 


24.91 


9.46 


29.20 


32.33 


5.91 


5.33 


5.64 


Origin 2000 


512 


10.77 


7.34 


6.56 


2.42 


5.06 


3.31 


1.39 


1.26 


1.20 


Origin 2000 


1024 


85.86 


48.47 


40.98 


19.20 


54.56 


44.53 


10.96 


9.95 


9.57 



Table 2. Running times of unrolled versions of LU (seconds) 



4 Experimental Results 

We used the SUIF compiler infrastructure Q to implement the recursion un- 
rolling algorithms presented in this paper. We present experimental results for 
two divide and conquer programs: 

— Mul: Divide and conquer blocked matrix multiply. The program has one 
recursive procedure with 8 recursive calls and a base problem size of a matrix 
with one element. 

— LU: Divide and conquer LU decomposition. The program has four mutually 
recursive procedures. Each of them has a base problem size of a matrix with 
one element. The main recursive procedure has 8 recursive calls. 

We implemented our compiler as a source-to-source translator. It takes a C 
program as input, locates the recursive procedures, then unrolls the recursion to 
generate a new C program. We then compiled and ran the generated C programs 
on three machines: a Pentium III machine running Linux, a PowerPC running 
Linux, and an SGI Origin 2000 running IRIX. 
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Table n presents the running times for various versions of the Mul program. 
Each column is labeled with the number of times that the compiler unrolled the 
recursion; we report results for the computation unrolled 0, 1, 2, and 3 times. 
If the column is labeled with an f, it indicates that compiler applied the con- 
ditional fusion transformation. If the column is labeled with an r, it indicates 
that the compiler applied the recursion re-rolling transformation. So, for exam- 
ple, the column labeled lu-|-fr contains experimental results for the version with 
the recursion unrolled once and with both conditional fusion and recursion re- 
rolling. Depending on the architecture, the best automatically unrolled version 
of program Mul performs between 3.6 to 10.8 times better than the unopti- 
mized version Table 0 presents the running times for various versions of the LU 
decomposition program. Depending on the architecture, the best automatically 
unrolled version of this program performs between 3.8 to 8.6 times better than 
the unoptimized version. 

We also evaluate our transformations by comparing the performance of our 
automatically generated code with that of several versions of the programs with 
optimized, hand coded base cases. We obtained these versions from the Cilk 
benchmark set j0|. The last column in Tables d and El presents the running times 
of the hand coded versions. The best automatically unrolled version of Mul per- 
forms between 2.2 and 2.9 worse than the hand optimized version. The perfor- 
mance of the best automatically unrolled version of LU is basically comparable 
to that of the hand coded version. These results show that our transformations 
can generate programs whose performance is close to, and in some cases identical 
to, the performance of programs with hand coded base cases. 

4.1 Impact of Re-rolling 

The running times in Tables d and El emphasize the impact of recursion re-rolling 
on the performance of the program. Whenever the unrolled recursion makes 
the program skip its largest unrolled base case, recursion re-rolling can deliver 
substantial performance improvements. For instance, in the case of program 
Mul with recursion unrolled twice, running on the Origin 2000, on a matrix of 
512x512 elements, recursion re-rolling dramatically improves the performance 
— with recursion re-rolling, the running time drops from 29.81 seconds to 3.95 
seconds. For this example, recursion inlining and conditional fusion produce 
additional base cases of sizes 2 and 4. But because the unrolled recursive part 
divides each problem in several subproblems of size 1/8 at each step, these base 
cases never get executed. The program always executes the inefficient base case 
of size 1. Re-rolling the recursion ensures that the efficient base case of size 4 
gets always executed. 

The structure of the inlined recursion also explains why programs whose 
recursive part is not re-rolled may perform worse after several inlining steps. For 
instance, the Mul program on a Pentium III has a surprisingly high running time 
of 11.49 seconds after two unrolling iterations, compared to its fast execution of 
3.41 seconds after a single unrolling iteration. The reason for this performance 
difference is that, for the given problem size of 512, the problem is always reduced 
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to a base case of size 2 when recursion is unrolled once, while the program always 
ends up executing the smaller and inefficient base case of size 1 when recursion 
is unrolled twice. 

Finally, our results for different problem sizes show that the impact of re- 
rolling depends on the problem size. For LU running on the PowerPC, with 
recursion unrolled twice, the version with re-rolling runs 3.08 times faster than 
the original version for a matrix of 512x512 elements, and 5.47 times faster than 
the original version for a matrix of 1024x1024 elements. The fact that the size 
of the base case that gets executed before re-rolling depends on the problem size 
explains this discrepancy. 

4.2 Impact of Conditional Fusion 

Our results show that conditional fusion can achieve speedups of up to 1.5 over 
the versions without conditional fusion, as in the case of the Mul program run- 
ning on the SGI Origin 2000, on a matrix of 512x512 elements, with recursion 
unrolled once. In the majority of cases, fusion of conditionals improves program 
performance. In a few cases, though, the modified cache behavior after condi- 
tional restructuring causes a slight degradation in the performance. 

The major advantage of conditional fusion transformation is that it enables 
recursion re-rolling, which has significant positive impact on the running times. 
To apply recursion re-rolling, the compiler has to identify and separate the re- 
cursive parts and the base cases. Conditional fusion is the key transformation 
that enables the compiler to identify these parts in the unrolled code. 

5 Related Work 

Procedure inlining is a classical compiler optimization 0 |21 S 0 ^1 13 HT] . 
The usual goal is to eliminate procedure call and return overhead and to enable 
further optimizations by exposing the combined code of the caller and callee 
to the intraprocedural optimizer. Some researchers have reported a variety of 
performance improvements from procedure inlining; others have reported that 
procedure inlining has relatively little impact on the performance. 

Our initial recursion unrolling transformation is essentially procedure in- 
lining. We augment this transformation with two additional transformations, 
conditional fusion and recursion re-rolling, that significantly improve the perfor- 
mance of our target class of divide and conquer programs. We therefore obtain 
the benefit of a reduction in procedure call and return overhead. We also ob- 
tain more efficient code that eliminates redundant conditionals and sets up the 
recursion so as to execute the efficient large base case most of the time. 

In general, we report much larger performance increases than other resear- 
chers. We attribute these results to several factors. First, we applied our tech- 
niques to programs that heavily use recursion and therefore suffer from signifi- 
cant overheads that recursion unrolling can eliminate. Second, conditional fusion 
and recursion re-rolling go beyond the standard procedure inlining transforma- 
tion to further optimize the code. 
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6 Conclusion 

This paper presents recursion unrolling, a technique for improving the perfor- 
mance of recursive computations. Like loop unrolling, recursion unrolling reduces 
control flow overhead and increases optimization opportunities by generating 
larger basic blocks. But recursion unrolling is somewhat more complicated than 
loop unrolling. The basic form of recursion unrolling reduces procedure call over- 
heads such as saving registers and stack manipulation. We have developed two 
transformations that optimize the code further. Conditional fusion merges con- 
ditionals with identical expressions, considerably simplifying the control flow in 
unrolled procedures. Recursion re-rolling rolls back the recursive part of the pro- 
cedure to ensure that the biggest unrolled base case is always executed, regardless 
of the problem size. 
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Abstract. This paper describes an empirical study of selective opti- 
mization using the Jalapeno Java virtual machine. The goal of the study 
is to provide insight into the design and implementation of an adaptive 
system by investigating the performance potential of selective optimiza- 
tion and identifying the classes of applications for which this performance 
can be expected. Two types of offline profiling information are used to 
guide selective optimization, and several strategies for selecting the meth- 
ods to optimize are compared. 

The results show that selective optimization can offer substantial im- 
provement over an optimize-all-methods strategy for short-running ap- 
plications, and for longer-running applications there is a significant range 
of methods that can be selectively optimized to achieve close to opti- 
mal performance. The results also show that a coarse-grained sampling 
system can provide enough accuracy to successfully guide selective opti- 
mization. 



1 Introduction 

One technique for increasing the efficiency of Java applications is to compile 
the application to native code, thereby gaining efficiency over an interpreted 
environment. To satisfy Java’s dynamic semantics, such as dynamic class load- 
ing, most compilation-based systems perform compilation at runtime. However, 
any gains in application execution performance achieved by such JIT (“Just In 
Time”) systems must overcome the cost of application compilation. 

An alternative strategy is to selectively optimize the methods of an appli- 
cation that dominate the execution time with the goal of incurring the cost of 
optimization for only those methods in which the performance of compiled code 
will be most beneficial. One example of this approach is to use two compilers: 
a quick nonoptimizing compiler and an optimizing compiler. With this compile- 
only strategy, all methods are initially compiled by the nonoptimizing compiler 
and only selective methods are optimized. 

* Funded, in part, by IBM Research and NSF grant CCR-9808607. 

S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 43-11771 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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The goal of this work is to empirically evaluate the limitations of selective 
optimization. Although systems that selectively optimize methods during exe- 
cution do exist E2 El El 0 ES El the focus of such work is typically overall 
system performance, which is affected by many characteristics of the particular 
system being studied. These characteristics include 1) when a recompilation de- 
cision is made, 2) the overhead and online nature of the profiling information, 
3) the sophistication and overhead of the recompilation decision-making com- 
ponent, and 4) the efficiency and performance increase obtained by using the 
optimizing compiler. This study holds these factors constant to provide a com- 
prehensive evaluation of the effectiveness of selective optimization. Namely, 1) all 
optimization occurs before execution; 2) offline profiling is used; 3) optimization 
decisions are made before execution is begun; and 4) the Jalapeho optimizing 
compiler H2| is used. 

In this work we 

— establish the performance gains possible for selective optimization and iden- 
tify classes of applications for which these gains can be expected; 

— define metrics that characterize the performance of selective optimization 
and report their values; 

— compare the performance of selective optimization when driven by two dif- 
ferent profiling techniques, time-based and sample-based; and 

— investigate different strategies for choosing methods to optimize. 

The results provide useful insight for the design and implementation of an 
adaptive system. We have observed that selective optimization can offer sub- 
stantial improvement over an all-opt (optimize all executing methods) strategy 
for our short-running benchmark suite, and for the longer-running suite there is 
a significant range of methods that can be selected to achieve close to optimal 
performance. The results also show that a coarse-grained sampling system can 
provide the accuracy necessary to successfully guide selective optimization, sug- 
gesting that the performance results may be obtainable with an adaptive system. 
Indeed, the results from this work were used during the design of the Jalapeno 
adaptive system 0. 

The rest of this paper elaborates on these points and is organized as follows. 
Sections El and El describe the background and methodology used in this study. 
Sectional presents the main results and direct conclusions. Section 0 investigates 
the performance of several automatic method selection strategies. Section i de- 
scribes related work. Section 0 draws conclusions and discusses these findings 
with respect to adaptive optimization. 



2 Background 

The results presented in this study were obtained using the Jalapeho JVM m 
being developed at IBM T.J. Watson Research Center. Except for a small amount 
of code that manages operating system services, Jalapeho is written completely 
in Java 0. In addition to eliminating language barriers between the application 
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and the JVM, writing a JVM in Java also enables optimization techniques to be 
applied to the JVM itself, such as inlining JVM code directly into the application 
and adaptively optimizing the JVM to the application. 

An advangage of a compile-only approach is that optimized and unoptimized 
code can be used together, without the overhead of entering and exiting inter- 
pretation mode. Jalapeno currently contains two fully operational compilers, a 
nonoptimizing baseline compiler and an optimizing compiler ng. The version of 
the optimizing compiler used for this study performs the following optimizations: 

— on-the-fly optimizations during bytecode to IR translation, such as copy 
propagation, constant propagation, dead code elimination, and register re- 
naming for local variables [SI]; 

— flow-insensitive optimizations, such as scalar replacement of aggregates and 
the elimination of local common subexpressions, redundant bounds checks 
(within an extended basic block jl4j l and redundant local exception checks; 

— semantic expansion transformations m of standard Java library classes and 
static inlining, including the inlining lock/unlock and allocation methods. 

These optimizations are collectively grouped together into the default optimiza- 
tion level (level 1). More sophisticated optimizations are under development. 

Both the nonoptimizing (baseline) and optimizing compilers compile one 
method at a time, producing native RS/6000 AIX machine code. On average 
the nonoptimizing compiler is 76 times faster than the optimizing compiler used 
in this study and the resulting performance benefit can vary greatly. For our 
benchmark suite, the execution time speedup ranges from 1.22-8.17 when the 
application is completely compiled by the optimizing compiler. 

Jalapeno provides a family of interchangeable memory managers |Sj with a 
concurrent object allocator and a stop-the-world, parallel, type-accurate garbage 
collector. This study uses the nongenerational copying memory manager. 



3 Methodology 

The experimental results were gathered on an IBM RS/6000 PowerPC 604e 
with two 333MHz processors and 1048MB RAM, running AIX 4.3. Both pro- 
cessors were free to be used during the execution of the benchmarks, however, 
no background or parallel compilation is performed; all compilation occurs pre- 
execution, and Jalapeno currently does not perform parallel compilation. The 
study uses the SPECjvmO^ benchmark suite Jalapeno’s optimizing com- 
piler the pBOB multithreaded business transaction benchmark PI , and the 
Volano benchmark m which is a multithreaded server application that simu- 
lates chat rooms. The benchmarks range in cumulative class file sizes from 10,156 
(209_db) to 1,516,932 (opt-cmp) bytes. 

^ These results do not follow the official SPEC reporting rules, and therefore should 
not be treated as official SPEC results. 
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The performance runs are grouped into two categories: short running and 
longer running. The short-running category includes the SPECjvm98 bench- 
marks using the size 10 (medium) inputs and the Jalapeho optimizing com- 
piler compiling a small subset of itself to give a comparable runtime. The pBDB 
and Volano benchmarks are designed to be longer-running server benchmarks 
and thus they are not included in the shorter running benchmarks. The longer- 
running category includes the SPECjvm98 benchmarks using the size 100 (large) 
inputs, the Jalapeho optimizing compiler compiling a larger subset of itself, and 
the pBOB and Volano benchmarks. The pBOB and Volano benchmarks run for 
a fixed amount of time and report their performance as a throughput metric 
in events per second. To allow comparison, we report an equivalent metric of 
seconds per 750,000 transactions for pBDB and seconds per 140,000 messages for 
Volano. 

Each experiment first compiles all application methods using the nonoptimiz- 
ing compiler (as is the case in a compile-only adaptive system) to ensure that 
all classes will be loaded when the optimizing compiler is invoked. This avoids 
(in optimized code) the performance penalties associated with dynamic linking, 
i.e., calls to methods of classes that have not been loaded, and also allows the 
called method to be a candidate for inlining. Next, a selected number of ap- 
plication methods are compiled again by the optimizing compiler. Finally, the 
application is executed without any further compilation being performed. Thus, 
the application’s execution time contains no compilation overhead. 

Each benchmark is executed a repeated number of times, where the number 
of methods optimized, N , is increased. These methods are chosen in order of 
hotness, based on a past profile of the unoptimized application with the same 
input. Thus, the number of methods that can be optimized is bound by the 
number of profiled methods. For each benchmark we chose the following values 
of iV: 0, 1, 2, . . . , 19, 20, 30, 40, . . . , # profiled methods. We use the term “all 
methods” to refer to all methods present in the profile. 

For each value of N , three quantities are measured: nonoptimizing compi- 
lation time (of all methods), optimization time (of N methods), and execution 
time. We refer to the total of these three times as the combined time. All times 
reported are the minimum of 4 runs of the application for the given value of N. 
The startup time of the Jalapeho virtual machine is not counted in the timings 
because it would simply add a constant factor to all times. Optimizing startup 
time of a JVM is a separate research topic. 



3.1 Profiling 

Two types of profiling information are used to determine the order in which 
methods are selected for optimization: time-based and sample-based. The time- 
based profile records time spent in each method, and is collected by instrument- 
ing all calls, returns, throws, and catches to check the current time and add it 
to the appropriate method. Methods are ordered based on the total time spent 
in the method. 
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The sampling-based profiling technique samples the running application ap- 
proximately once every 10 milliseconds when the application reaches predefined 
sample points. These points are located on entry to a method and on loop back 
edges. Samples taken on loop back edges are attributed to the method contain- 
ing the loop. Samples taken on method entries are attributed to both the calling 
and called method. Methods are ordered based on the total number of samples 
attributed to them. This sampling technique is efficient — the overhead intro- 
duced does not stand out above the noise from one run to the next — and it is 
being used in the initial Jalapeho adaptive system |^. 

The method orderings from both profiles are likely to contain a certain de- 
gree of imprecision. Sampling every 10 milliseconds is fairly coarse-grained con- 
sidering the speed of the processor; thus the sampled profile will contain some 
imprecision, especially for short-running applications. The time-based profile is 
also imprecise because the instrumentation introduces overhead, increases code 
size, and disrupts the instruction cache. 

A major difference between two profiles is that the time-based profile is com- 
plete in the sense that all executing methods will appear in the profile. The 
sample-based profile is an incomplete, or partial profile, because it is possible for 
a method to execute but never be sampled. 

4 Results 

This section presents the main results of the study. Section 14.11 describes the 
results using the time-based profile. Section 14.21 describes the results using the 
sample-based profile and compares these results to those from Section oi 



4.1 Time-Based Profile 

Each application in the two suites can be plotted on a graph where the x- 
axis measures the number of (hot) methods selected for optimization and the 
y-axis measures time in seconds. Fig. ^shows representative gr^hs for the short- 
running and longer-running suites using the time-based profile □ Each graph con- 
tains two curves, a solid curve connects points of the form (number-of-optimized- 
methods, combined time). A dashed curve connects points of the form (number- 
of-optimized-methods, execution time), i.e., the dashed curve does not include 
any compilation time. The time to compile all methods with the nonoptimizing 
compiler is included in the combined time (solid curve). This time ranged from 
0.06-0.63 seconds, averaging 0.19 seconds. 

One observation to make is the execution time difference between the short- 
running and longer-running benchmark suites. With all methods optimized, the 
execution time only (the rightmost point on the dotted curve) is in the range 
of 1.1-4. 8 seconds for the short-running benchmarks, and 16.6-70.6 seconds for 
longer-running benchmarks. Thus it is not surprising that compilation time is a 

Due to space constraints we are unable to show all graphs. 
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Short-running benchmarks with time-based profile 





Longer-running benchmarks with time-based profile 





Fig. 1. Graphs for representative short-running and longer-running bench- 
mark suites for the time-based profiles. The x-axis is the number of meth- 
ods that are optimized. The y-axis is time in seconds. The dashed curve 
connects points for execution time. The solid curve connects points for the 
combined time (execution -|- compilation), x-axis points are recorded for 
(0, 1, ..., 20, 30, 40, ..., # profiled methods). 



larger percentage of combined time for short-running benchmarks than for the 
longer-running benchmarks. 

Another trend across the benchmarks is a steep initial drop in the curves, 
reaching a minimum rather quickly. After the combined time (solid curve) 
reaches a minimum, most of the size 10 graphs rise again, while the size 100 
graphs remain relatively flat. 

Table Q] describes characteristics of the best combined time ( “Best” ) using 
selective optimization, i.e., the lowest point on the solid curves, for all bench- 
marks, using the time-based profile. The first column of Table IDlists the bench- 
mark name. The remaining columns present the results for the two categories 
of benchmarks, four columns for the short-running programs and four columns 
for the longer-running programs. Within each category the first three columns 
report the characteristics used to achieve the best combined time (Best) using 
selective optimization. This could be no methods optimized, all methods op- 
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Table 1. Effectiveness of selective optimization using time-based profiling for 
both the short and longer-running benchmarks. Recall that pBDB and Volano are 
not included in the short-running suite because they are designed to be server 
benchmarks. 



Benchmark 


Short-running suite 


Longer-running suite 


Best Characteristics 


Best 

Combined 
Time [s] 


Best Characteristics 


Best 

Combined 
Time [s] 


Methods for 
Best 


Pet of 
Profile 
Covered 


Methods for 
Best 


Pet of 
Profile 
Covered 


Number 


Pet 


Number 


Pet 


201_compress 


16 


9% 


99.9% 


3.8 


15 


8% 


99.9% 


39.9 


202_jess 


7 


1% 


88.6% 


1.7 


40 


8% 


98.9% 


31.3 


209_db 


4 


2% 


84.6% 


1.5 


9 


5% 


99.9% 


70.9 


213_javac 


0 


0% 


0% 


2.7 


160 


17% 


92.9% 


49.1 


222_mpegaudio 


50 


14% 


99.8% 


4.8 


100 


29% 


99.9% 


30.0 


227_mtrt 


20 


6% 


86.3% 


3.8 


70 


21% 


99.2% 


19.7 


228_jack 


3 


1% 


40.5% 


5.8 


13 


3% 


82.4% 


44.3 


opt-emp 


1 


0% 


11% 


11.2 


290 


18% 


95.6% 


75.8 


pBOB 


— 


— 


— 


— 


200 


33% 


98.9% 


54.8 


Volano 


— 


— 


— 


— 


13 


2% 


93.9% 


38.6 


Average 


12.6 


3.8% 


63.8% 


4.4 


91.1 


14.4% 


96.1% 


45.4 



timized, or some methods optimized in order of hotness. The column marked 
“Number” reports the number of (hot) methods optimized. The column marked 
“Pet” gives the percentage of all methods in the application that are optimized. 
The column marked “Pet of Profile Covered” gives the percentage of the time- 
based profile that is covered by the selected methods. The following column gives 
the Best time in seconds. 

The number of methods optimized for Best varies greatly across the bench- 
marks, ranging from 0 to 290 methods (0-33% of all methods), demonstrating 
that the benchmarks have rather different execution patterns regarding method 
hotness. For example, 201_compress spends 99.9% of its execution time in just 
9% of its methods. This is in contrast to the more object-oriented benchmarks, 
such as 213_javac, opt-emp and pBOB, which distribute their execution time 
more evenly among a much larger set of methods, and thereby presenting a big- 
ger challenge for selective optimization. For the short-running benchmarks. Best 
includes 0 and 1 methods for 213_javac and opt-emp, respectively, covering 
only 0% and 11% of the time-based profile. For the longer-running benchmarks. 
Best selects the largest number of methods to optimize for the three object- 
oriented benchmarks. Yet despite the large number of methods optimized, these 
benchmarks are among the four smallest regarding percentage of profile covered. 

The bar charts in Figs. 0 and El compare the combined time (compilation 
-|- execution) performance of three interesting method selection strategies: no 
methods optimized (none-opt), all methods optimized (all-opt), and Best. Recall 
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that Best, by definition, gives the best combined time, so it will always be better 
than or equal to the other two configurations. Below each benchmark name is 
the execution time only when all methods are optimized (corresponding to the 
right-most point on the dashed lines in Fig. which we refer to as all-opt-exe. 
All bars are normalized to this value; thus, all three method selection strategies 
are shown relative to a system in which compile time is free and all methods run 
at peak (for this environment) performance^ Given Java’s dynamic semantics, 
it does not appear that such a system, without any runtime compilation or 
interpretation, is possible. 




3.54 1.14 1.23 1.56 2.89 1.70 4.40 4.16 2.63 

Fig. 2. Short-running suite performance comparison. The height of each bar 
is the combined time normalized to all-opt-exe time. The numbers under the 
benchmark names are the all-opt-exe time in seconds. 



For the short-running benchmarks (Fig. 0, Best averages an additional 67% 
slower than all-opt-exe. The all-opt scheme is the worst of the three selections 
for 6 of the 8 benchmarks. This in not surprising because the benchmarks do 
not run long enough to recover the cost of optimizing all methods. For 5 of 8 
benchmarks, the improvement of Best over the other two extremes is substantial. 
For all benchmarks it averages 56% better than none-opt and 71% better than 
all-opt. 

For the longer-running benchmarks (Fig.0, Best averages only an additional 
13% in combined time compared to an all-opt-exe configuration. All-opt is much 
closer to Best for all cases, averaging 28% worse than all-opt-exe. Because of the 

® An adaptive system can utilize feedback-directed optimizations, i.e., optimizations 
based on the application’s current execution profile, to potentially obtain perfor- 
mance beyond that of all-opt-exe. 
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compress jess db javac mpegaudio mtit jack opt-cmp pBOB Volano average 

39.56 30.00 70.60 39.91 27.47 16.69 41.34 50.25 47.70 37.77 39.90 



Fig. 3. Longer-running suite performance comparison. The height of each bar 
is the combined time normalized to all-opt-exe time. The numbers under the 
benchmark names are the all-opt-exe time in seconds. 



benchmarks longer execution time, none-opt is the worst strategy in all cases. 
These results suggest that although an all-opt strategy is effective for longer- 
running applications the optimization overhead is not insignificant. 



Sweet Spot The previous performance results for Best are based on a particular 
ordering of the methods using an offline time-based profile. However, one may 
wonder how closely a technique must estimate the set of actual hot methods 
that achieve Best to obtain reasonable performance. To address this question, 
we record the number of collected timings that are within N percent of the 
Best combined time. We refer to this number as being in the sweet spot for this 
benchmark. Table |3 provides this information for the following values of N: 1, 5, 
10, 25, and 50. The value is expressed as the percentage of all collections within 
N% of the Best combined time. Due to space considerations we present for each 
sweet spot range the average and range of percentages over the benchmarks 
in each suite. For example, the value of 2.6% in the “1%” column states that 
on average, 2.6% of the collected timings of the short-running benchmarks fell 
within 1% of the Best time. The next row, “l-5%” states that 2.6% represents 
an average of values that ranged from 1-5%. 

This table quantifies properties that are visually present in the graphs of 
Fig. ID as well as those graphs not shown. For example, the graph of the short- 
running 213_javac in Fig. Hshows few points as low as the Best value at point 
(0, 2.39), i.e., the no methods optimized resulted in the best running time. In 
fact, for this benchmark only 1% of the x-axis data points are within 50% of 
the Best combined time. This is in contrast to the graph of the longer-running 
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Table 2. Percentage of x-axis data points that are within N% of Best combined 
time using a time-based profile for TV = 1, 5, 10, 25, 50. 



Benchmark 


Metric 


Sweet Spot Percentage 


1% 


5% 


10% 


25% 


50% 


Short running 


Avg 


2.6% 


3.9% 


6.2% 


13.0% 


23.0% 


Short running 


Range 


1-5% 


1-11% 


1-20% 


1-34% 


1-50% 


Longer running 


Avg 


7.1% 


33.1% 


62.2% 


89.4% 


92.5% 


Longer runnnig 


Range 


1-16% 


5-95% 


10-95% 


25-100% 


50-100% 



222_mpegaudio, where 90% of the selections are within 5% of Best performance. 
Although the longer-running benchmarks, in general, are more forgiving than the 
short-running benchmarks, their 5% sweet spots do vary considerable, 5%-95%. 

4.2 Sample-Based Profile 

As mentioned in Section EH a sample-based profile was also used to examine the 
feasibility of selective optimization. The same experiments were repeated, except 
the sample-based profile was used to establish method hotness. Fig. 2| shows two 
example performance graphs using the sample-based profile. The curves start 
out similar to those from the time-based profile; however, the curves end much 
earlier because only a subset of the executing methods are sampled. 



Sample-based profile 





Fig. 4. Graphs for representative short-running and longer-running benchmark 
suites for the sample-based profiles 



Table El shows characteristics of the sample-based profile for all benchmarks. 
The second column shows the percentage of all executed methods that are sam- 
pled, ranging from 6.7%-57.5% and averaging 24.6%. The remaining columns of 
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Table 3. Comparison of “Best” using the sample-based and time-based profile 







% Methods 


Num Methods for Best 


Sampled 


Sampled Best / 




Benchmark 


Sampled 


Time 


Sample 


Both 


Best Time 


Time Best 




201 -Compress 


8.5% 


16 


10 


10 


3.8 


1.00 


3 


202-jess 


6.7% 


7 


4 


4 


1.7 


1.00 


bJO 


209-db 


9.9% 


4 


8 


3 


1.6 


1.02 


.s 

3 


213-javac 


13.6% 


0 


0 


0 


2.7 


1.00 




222-mpegaudio 


15.5% 


50 


40 


28 


4.8 


1.01 


1 


227-mtrt 


26.0% 


20 


20 


14 


3.9 


1.01 


o 


228-jack 


17.7% 


3 


5 


4 


5.7 


0.99 


CO 


opt-cmp 


29.7% 


1 


17 


1 


9.4 


0.84 




Average 


15.9% 


12.6 


15.5 


8.0 


4.2 


0.98 




201-Compress 


11.0% 


15 


11 


10 


39.9 


1.00 


03 


202-jess 


15.3% 


40 


40 


36 


31.4 


1.00 


y 

tZ2 


209-db 


8.8% 


9 


7 


5 


70.6 


1.00 


bJO 


213-javac 


41.9% 


160 


160 


122 


49.3 


1.01 


3 


222-mpegaudio 


27.2% 


100 


90 


64 


30.2 


1.01 




227-mtrt 


34.8% 


70 


60 


50 


19.2 


0.97 


1 


228-jack 


33.4% 


13 


30 


12 


43.8 


0.99 


bJO 


opt-cmp 


57.5% 


290 


220 


203 


75.2 


0.98 


O 


pBOB 


46.2% 


200 


210 


157 


53.4 


0.97 




Volano 


13.6% 


13 


12 


2 


39.0 


1.01 




Average 


28.9% 


91.0 


88.0 


67.9 


45.2 


0.99 



Table 0 show characteristics about the Best from the sampled-profile (sampled- 
best), and compare it to Best from the time-based profile (time-best). The three 
columns marked “Num Methods for Best” give the number of methods that are 
optimized to achieve time-best and sample-best, respectively, and the number of 
methods optimized in both cases, i.e., the intersection of the methods of the two 
previous columns. For the short-running benchmarks, the methods selected for 
Best are similar for both profiles; however, for the longer-running benchmarks 
there is a fairly large variation, especially for 213_javac, opt-cmp, and pBDB. 

The last two columns of Table 0 show the running time of sample-best and 
the ratio of sample-best to time-best. Sample-best shows almost no degradation 
to time-best, and in several cases is better than time-best. As mentioned in 
Section EH sample-best can be better than time-best because instrumentation 
perturbs the code. Thus, it is possible for the hotness ordering generated by the 
sampled profile to be more accurate than the time-based profile. Furthermore, 
because all calls and returns are instrumented, the error is likely to be CTeatest 
for benchmarks that contain a large number of short-lived method calls u 



The methodology used for selecting methods to optimize does not consider char- 
acteristics of the methods such as compile time or the performance improvement 
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The most important observation to be made from this data is that the sample- 
based, incomplete profile achieves the same Best performance as the time-based 
profile independent of the running time of the application. The key is that the 
sample-based profile is accurate enough for the methods that matter. For the 
short-running applications there is not much time to optimize, so only the hottest 
few methods need to be detected by the sample-based profile. As the program 
runs longer and there is more time to optimize, the sample-based profile will 
have had enough time to reasonably order the needed number of methods. 

5 Using Profiling as a Predictor for Method Selection 

Section 0 determined the Best set of methods to optimize by observing the 
timings of many different executions. In an adaptive optimization system, this 
decision will be made by observing the profiling information and predicting which 
methods would be the best to optimize. This section investigates the effectiveness 
of some simple prediction strategies to determine if there exists a single strategy, 
for all benchmarks, that can approach the results of Best. The following strategies 
are considered: 

Num methods: Optimize the N hottest methods, 

Percent of methods: Optimize N% of all profiled application methods (in 
hotness order), 

Time/Samples in method: Optimize any method that is executed for more 
than N milliseconds or N samples, 

Time/Samples in method relative to hottest method: Optimize any 

method that is executed for more than N% of the time of the hottest method. 
Percent profile coverage: Optimize enough hot methods (in hotness order) 
to cover N% of the profiled execution. 

For each strategy a wide range of values for N were explored, with the goal of 
finding a value of N which performs well for all of the benchmarks of similar exe- 
cution time. The value of N that resulted in the lowest performance degradation 
from Best, averaged over all benchmarks of similar size, is chosen. 

Table m summarizes the results for the short-running and longer-running 
suites, respectively, for both types of profiles. The first column lists the profile 
used and benchmark suite. The last five columns list the strategies as described 
above. For each pair of profile type and benchmark suite, there are three rows. 
The first row gives the value of N that resulted in the lowest average perfor- 
mance degradation above Best. Given this value of N, the second and third rows 
show the average percentage degradation from (worse than) Best, and the range 
for all benchmarks in that suite, respectively. For example, for the short-running 
benchmarks with the time-based profile, optimizing the 9 hottest methods re- 
sulted in an average degradation of 31% from Best for those benchmarks, with 
values ranging from 0.1%-71.0%. 



of optimization. Therefore, it is possible that either profiling technique could “get 
lucky” if it happens to bias small methods, or methods that optimize well. 
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Table 4. Summary of the effectiveness of predictor strategies 



Profile 

Used/ 

Benchmark 

Suite 


Metric 


Num 

Methods 


Pet 

of 

Methods 


Time/ 

Samples 

in 

Method 


Time/Samples 
in Method 
Relative to 
Hottest 
Method 


Pet 

Profile 

Coverage 


Time/ 

Short 

running 


Optimal Values 


9 


5% 


38810ms 


4.7% 


74.3% 


Pet 

Decrease 


Avg 


31.7 


49.1 


27.2 


41.5 


53.8 


Range 


0.1-71.0 


1.8-116 


0.0-73.5 


0-130 


0.2-135 


Time/ 

Longer 

running 


Optimal Values 


120 


19% 


253ms 


0.05% 


99.7% 


Pet 

Decrease 


Avg 


3.7 


2.5 


3.6 


3.7 


3.7 


Range 


0.2-6.6 


0.2-11.9 


1.5-17.2 


0.1-6.7 


0.2-7.3 


Sample/ 

Short 

running 


Optimal Values 


10 


11% 


10 


3.25% 


70.25% 


Pet 

Decrease 


Avg 


6.2 


35.0 


8.5 


42.2 


50.2 


Range 


0.0-31 


0.0-148 


0.0-51.4 


0.9-135 


0.9-147 


Sample/ 

Longer 

running 


Optimal Values 


180 


50% 


9 


0.10% 


99.2% 


Pet 

Decrease 


Avg 


2.5 


1.2 


1.0 


2.7 


2.3 


Range 


0.0-7.7 


0.0-2.5 


0.0-2. 2 


0.0-7.4 


0.1-9.7 



The first observation is that Best is harder to predict for the short-running 
benchmarks (first and third group of rows) than for the longer-running bench- 
marks (second and fourth group of rows), as can be seen by comparing the rows 
labeled “Avg” . The average degradations from Best for the short-running suite 
varies across the different prediction strategies from 27%-53% for the time-based 
profile and 6.2%-50.2% for the sample-based profile. For the longer-running 
benchmarks, the worst average degradation from both profiling techniques is 
3.7%. This extreme difference between the short- and longer-running suites is 
most likely not because the location of Best is inherently more difficult to predict 
for short-running applications, but because the sweet spot of the longer-running 
applications is much more forgiving. 

“Time/Samples in method” is the most effective overall strategy; its average 
degradation from Best is either the smallest, or close to the smallest, for all four 
profile/benchmark pairs. One explanation is that with short-running applications 
this metric limits the number of methods that are candidates for optimization 
because fewer methods exceed the specified threshold. This implicitly limits the 
compilation overhead in short-running applications, but allows a larger set of 
methods to be optimized in longer-running applications. 

Interestingly, the sample-based strategies are better than the time-based 
strategies (having a smaller average degradation from Best) for all but one case. 
This phenomenon could be partly due to inaccuracies in the time-based profile, 
however it can also be more fundamental in that method selection is inherently 
less error prone with a sample-based profile. Because the profile does not contain 
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all executed methods, it automatically limits the selection to methods to those 
that are sampled at least once, essentially adding a secondary condition to the 
prediction strategy^ 

In particular, “Samples in method” is the best overall strategy, performing 
very well for both the short- and longer-running benchmarks, and having its 
optimal values for both categories be very close (9 and 10). This is in contrast 
to all other short/longer strategy pairs, where the optimal values vary greatly. 
This result echoes the performance results obtained with an early version of 
the sampling-based Jalapeno adaptive system, which used an online version of 
“Samples in method” to guide optimization decisions. 



6 Related Work 

Radhakrishnan et al. used the Kaffe Virtual Machine to measure the perfor- 
mance improvement possible by interpreting, rather than compiling, cold meth- 
ods for a subset of the SPECjvm98 benchmarks with input size 1. Although the 
two studies touch on similar ideas, Radhakrishnan et al. focused on architectural 
issues for Java, while our study emphasized a comprehensive evaluation of selec- 
tive optimization designed to aid the design and implementation of an adaptive 
system. 

Kistler presents a continuous program optimization architecture for 
Oberon that allows “up-to-the-minute” profile information to be used in pro- 
gram reoptimization. He concludes that continuous optimization can produce 
better code than offline compilation, even when past profiling information is 
used. Optimizations are evaluated by computing ideal performance speedup, 
which does not include profile or optimization time. This speedup is used to 
compute the break-even point: the length of time the application would have to 
execute to compensate for the time spent profiling and optimizing. 

The HotSpot JVM j22j, the IBM Java Just-in-Time compiler (version 
3.0) |S2|, the Intel JUDO system HSI> and the Jalapeno JVM ^ are adaptive 
systems for Java. The first three systems initially interpret an application and 
later compile performance-critical code. The IBM JIT and Intel JUDO system 
uses method invocation counters augmented to accommodate loops in meth- 
ods to trigger compilation. JUDO optionally uses a thread-based technique in 
which a separate thread periodically inspects the counters and performs compi- 
lation. Jalapeno uses a compile-only approach based on the same low-overhead 
sampling that is used in this paper. It employs a cost/benefit model to trigger 
recompilation. Details of the HotSpot compilation system are not provided. 

Holzle and Unger describe the SELF-93 system, an adaptive optimiza- 
tion system for SELF. Method invocation counters with an exponential decay 
mechanism are used to determine candidates for optimization. Unlike the results 
of our study, they conclude that determining when to optimize a method is more 

® The sampled profile does not add a secondary condition to “Time/Samples in 
method” because it already limits the selection in exactly the same manner. 
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important than determining what to optimize. We attribute this contrasting con- 
clusion to the differing goals of the two systems. The goal of their system is to 
avoid long pauses in interactive applications, where as we focus on achieving 
optimal performance for short- and longer-running noninteractive applications. 

Detlefs and Agesen present two studies of selective optimization using the 
SPECjvm98 benchmarks, evaluating both offline HHI and online 0 strategies 
for selecting methods to optimize. Both studies utilize three possible modes of 
execution: an interpreter, a fast JIT, and an expensive optimizing compiler. 
Their studies confirm the performance benefits of selective optimization, and 
offer evidence that using all three modes is an effective approach. 

Plezbert and Cytron m propose continuous compilation, as well as several 
other selective optimization strategies, and simulate their effectiveness on C 
programs. 

Bala et al. 0 describe Dynamo, a transparent dynamic optimizer that per- 
forms optimizations on a native binary at runtime. Dynamo initially interprets 
the program, keeping counters to identify sections of code called hot traces. 
These sections are then optimized and written as an executable. The authors 
report an average speedup of 9%. 

Burger and Dybvig min] explore profile-driven dynamic-recompilation in 
Scheme, with an implementation of a basic block reordering optimization. They 
report a 6% reduction in runtime and an average recompile time of 12% of their 
base compile time. 

Hansen m describes an adaptive FORTRAN system that makes automatic 
optimization decisions. When a basic block counter reaches a threshold, the 
basic block is reoptimized and moved to the next optimization state, where more 
aggressive optimizations are performed. 

Another area of research considers performing runtime optimizations that 
exploit invariant runtime values 0 EOl Earn 113 ESI . The main disadvantage 
of these techniques is that they rely on programmer directives to identify the 
regions of code to be optimized. There are also nonadaptive systems that perform 
compilation/optimization at runtime to avoid the cost of interpretation. This 
includes early work such as the Smalltalk-80 EH] Self-91 E2I systems, as 
well many of today’s JIT Java compilers niEiEa. 

There are also fully automated profiling systems that use transparent profil- 
ing to improve performance of future executions. Such systems include Digital 
FX!32 EH, Morph |57], and DCPI |5|. However, these systems do not perform 
compilation/optimization at runtime. 

7 Conclusions 

This paper describes an empirical study of selective optimization using the 
Jalapeno JVM. Two categories of benchmarks, short running (1.1-4. 8 seconds) 
and longer running (16.6-70.6 seconds), are studied using SPECjvm98, pBOB, 
Volano, and the Jalapeno optimizing compiler as benchmarks. The main obser- 
vations are 
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~ The Best selective optimization strategy (counting compilation time) is 
within 67% of all-opt-exe for the short-running benchmarks and 14% for 
the longer-running benchmarks. 

— Best substantially outperformed both none-opt and all-opt for the short- 
running benchmark suite, averaging 56% and 71% improvement, respectively. 

— For the longer-running benchmarks, Best shows a much smaller improvement 
over all-opt, averaging 12%. 

— The sweet spot is fairly small for short-running benchmarks, and large for 
longer-running benchmarks. 

— Best from the incomplete, sample-based profile can match Best from the 
time-based profile independent of the application’s running time and sweet 
spot size, even though the sample-based profile contains only a fraction of 
the executed methods. 

— Predicting which methods to optimize is much harder for short-running ap- 
plications than for longer-running applications. Predictions based on the 
sample-based profile are more effective than predictions based on the time- 
based profile. “Samples in method” is the most effective overall predictor, 
averaging 8.5% and 1.03% degradation from Best, respectively, for the short- 
and longer-running applications. 



7.1 Impact on Adaptive Optimization 

The goal of this work is to gain a better understanding of the issues surround- 
ing selective optimization, and in doing so provide insight into the more general 
problem of adaptive optimization. Understanding the performance gains possi- 
ble from selective optimization, as well as the classes of applications for which 
these gains can be expected, is useful for guiding efforts in designing an effective 
adaptive system. 

As expected, selective optimization provides the largest potential benefit for 
the short-running benchmarks. The goal of an adaptive system will be to achieve 
as much of this win as possible. 

The longer-running applications presented in this paper are approaching the 
point where all-opt is only slightly worse than Best, which, in turn, is only 
slightly worse than all-opt-exe. In addition, the large sweet spot of the longer- 
running benchmarks makes it easy to predict a set of methods to optimize that 
will result in reasonable performance. We feel this is encouraging for adaptive 
optimization because it suggests that with little effort an adaptive system can 
achieve close to the performance of all-opt-exe, and by concentrating its efforts on 
feedback-directed optimizations, such as inlining and specialization, can achieve 
performance beyond that of a static compiler. 

The results of the sample-based profile are also encouraging. We now do 
not expect the accuracy of our coarse-grained sampler to be a problem for the 
adaptive system. The sample-based profile was more effective than the time- 
based profile when used to predict which methods to optimize, even for the 
short-running benchmarks, which run for as little as 1.5 seconds. This suggests 
that an adaptive system will be able to use an online, sample-based profile to 
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make effective optimization decisions, and will be able to trust the profile after 
a short amount of time. 

One concern is how an adaptive system should select methods for optimiza- 
tion. Even with the offline profile providing full knowledge of the future, the Best 
set of methods to optimize is difficult to predict for the short-running bench- 
marks. In an adaptive system, the problem becomes even more complex because 
less profiling information is available. Yet, unlike in this study, an adaptive opti- 
mization system is not required to make all of its optimization decisions at one 
time, and thus the predictors presented in Section El are not the only selection 
options available to an adaptive system. For example, an adaptive system could 
begin by optimizing only a small part of the application and continue until it 
is no longer improving performance. We see exploring such possibilities as an 
interesting area of future work. 
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Abstract. This paper describes our experience with MaJIC, a just-in- 
time compiler for MATLAB. In the recent past, several compiler projects 
claimed large performance improvements when processing MATLAB code. 
Most of these projects are static compilers suited for batch processing; 
MaJIC is a just-in-time compiler. The compilation process is trans- 
parent to the user. This impacts the modus operandi of the compiler, 
resulting in a few interesting analysis techniques. Our experiments with 
MaJIC indicate large speedups when compared to the interpreter, and 
reasonable performance when compared to static compilers. 



1 Introduction 

MATLAB [Oj, a product of Mathworks Inc., is a popular programming language 
and development environment for numeric applications. MATLAB is similar to 
FORTRAN 90: it is an imperative language that deals with vectors and matrices. 
Unlike FORTRAN, however, MATLAB is untyped and polymorphic: the seman- 
tic meaning of identifiers is determined by their mode of use, and the operators’ 
meaning depends on the operands. 

The main strengths of MATLAB lie both in its interactive nature, which 
makes it a handy exploration tool, and the richness of its precompiled libraries 
and toolboxes, which can be combined to solve complex problems. The trouble 
with MATLAB is its speed of execution. Unless the bulk of the computation is 
shifted into one of many built-in libraries, MATLAB pays a heavy performance 
penalty: two or even three orders of magnitude compared to similar Fortran 
code. 

The purpose of our work is to eliminate MATLAB’s performance lag while 
maintaining its interactive nature. Hence the requirement for dynamic, or just- 
in-time, compilation, which is used for similar purposes in other programming 
environments, including Smalltalk and Java. MaJIC acts like an interpreter: 
there is no need to deal with Makefiles and a lengthy compilation process, and 
therefore the user’s work flow is not interrupted. 

Just-in-time compilation has been around for awhile, but has recently gained 
new notoriety with the rise of Java. Before Java it was called dynamic compila- 
tion. Deutsch described a dynamic compilation system for the Smalltalk language 
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as early as 1984 Java benefits from JIT compilation by reduced machine in- 
struction count per line of bytecode, optimized register usage, interprocedural 
analysis and, in many cases, replacement of virtual function calls with static calls. 
Both Java and MATLAB benefit from JIT compilation because it is possible to 
ascribe more precise semantic meaning to the symbols in the program, thereby 
reducing the potential polymorphism of the code - although the optimizations 
turn out to be quite different in practice. 

The rest of this paper is organized as follows: we set the stage by describing 
related work, then we give a high-level overview of our compiler. We describe 
some of the performance enhancing features we implemented. Finally we present 
the results we have obtained so far and our conclusions and plans for the future. 



2 Other Work with MATLAB 

The techniques used in MaJIC are based on those developed for DeRose and 
Padua’s FALCON compiler PI1S|, a MATLAB to Fortran 90 translator imple- 
mented at the University of Illinois. 

MENHIR |3|, developed by Francois Bodin at INRIA, is similar to FALCON: 
it generates code for MATLAB and exploits parallelism by using optimized run- 
time libraries. MENHIR’S code generation is retargetable (it generates C or 
FORTRAN code). It also contains a type inference engine similar to FALCON’S. 

MATCOM pS) is a commercial MATLAB-compatible compiler family, de- 
veloped by Mathtools Inc. This is a full development environment which incor- 
porates a debugger, an editor, many optimizations, and even a JIT compiler. It 
is only commercially available. 

Octave |Hj is a freeware MATLAB implementation. Octave may not be very 
fast, but it is GNU freeware. 

MATCH P is a MATLAB compiler targeted to heterogeneous architectures, 
such as DSP chips and FPGAs. It also uses type analysis and generates code for 
multiple large functional units. 

MatMarks m and MultiMatlab m are projects that are somewhat simi- 
lar in that they extend MATLAB to deal with a multiprocessing environment. 
MultiMatlab adds MPI primitives to MATLAB; MatMarks layers MATLAB on 
top of the Treadmarks software DSM system. 

Vijay Menon, at Cornell University, is currently developing a MATLAB vec- 
torizer Hg, a tool that transforms loops into Fortran 90-like MATLAB expres- 
sions, built on top of the MaJIC compiler framework. We are hoping to integrate 
his work into the compiler soon. 



3 MaJIC Software Architecture 

In this section we give a short overview of MaJIC’s architecture and mode 
of operation. MaJIC’s interaction with the rest of the world is handled by 
an interpreter similar to MATLAB’s. Since MATLAB’s code is proprietary we 
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couldn’t expand it with the JIT compiler. Therefore we built our own interpreter 
which is capable of handling a subset of the MATLAB language. The interpreter 
scans the input file, parses it into a high-level Abstract Syntax Tree, and then 
executes the program by traversing the tree in program order. It also contains a 
symbol table, an operand stack, and a function finder. 

The analyzer performs three operations: symbol disambiguation, type in- 
ference, and optional loop dependence analysis. Analysis does not change the 
program tree, but rather annotates it. It is designed to be invoked at runtime; 
its input is in effect the current execution state of the interpreter, including the 
current symbol table and the operand stack. 

The code generator generates executable code directly based on the AST and 
its annotations, and does so very fast. The code generator allocates registers, 
emits machine code, resolves addresses and patches forward jumps. 

The code repository maintains order between the several versions of annota- 
tions and compiled code co-existing at a given time. Since a given piece of MAT- 
LAB code might be invoked several times, and since analysis depends heavily 
on the state of the interpreter at the time of invocation, there could be several 
sets of type annotations for the same code at the same time. 

4 MaJIC Compilation Techniques 

In this section we describe some of the techniques we used to speed up MATLAB 
execution. 



4.1 Symbol Disambiguation 

MATLAB symbols exclusive of keywords represent one of three things: variables, 
calls to built-in primitives, or calls to user functions. In the interpreter the de- 
cision is made at runtime when the symbol’s AST node is executed. We call a 
symbol occurrence ambiguous if the way the program is written allows multiple 
meanings for it. The code box below shows a loop where the first occurrence of 
the symbol i is ambiguous, interpreted as \/—l in the first iteration and as a 
variable afterwards. 



while(...), 

z = i; 
i = 1; 

end 



In most MATLAB programs, however, symbol ambiguity can be removed at 
compile time. In particular, symbols that represent variables can be conserva- 
tively identified by a variation of reaching definitions analysis: A symbol that 
has a reaching definition on all paths leading to it must be a variable. 
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The transfer function in this analysis can be set up as usual, but with a slight 
twist. It is defined on the set of the symbols in the program S and the nodes of 
the CFG JV as follows: 
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Equation G1 says, in effect, that a symbol generates a reaching definition if it 
appears on the left-hand side. The unusual part of this equation is its third line, 
which says that any symbol that cannot be proven to be a variable has the side 
effect of killing all reaching definitions. This is necessary because of the existence 
of built-in functions like clear, which destroys the whole symbol table. 



Use-def Chain Generation: the UD chain is obtained as a side-effect of the 
disambiguation step. It uses the results of the same reaching definitions analysis 
but determines the set of definitions for each variable by calculating the set union 
of its reaching definitions. 



4.2 Type Inference 

Our representation of a MATLAB expression type lies at the base of the inference 
system. This representation is very close to the one described in 0. We use the 
following type descriptors: 

— The intrinsic type of the expression, such as: real, integer, boolean, com- 
plex or string. 

— The shape of the expression, i.e. the maximum number of rows and columns 
it can have. 

— The range of real values the expression can take 0. We do not define the 
range for strings and for complex expressions. 

The type analyzer is a monotonic data analysis framework m, which requires 
the analyzed property to form a finite semi-lattice and all transition functions 
to be monotonic. 

Our lattice is the Cartesian product Li x Lg x Li of the three component 
lattices, which are defined as follows: 
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Li ={I, J_i, Ti,<i, Vi}, where 

2 = {J_i, bool, int, real, cplx, strg, T i| 

J_i <i bool <i int <i real <i cplx <i T i and J_i <i strg <i T i 
Ls ={N X N, J_s , Ts, <s, Vs}, where N is the set of natural numbers, and 
J_s = {0, 0}, Ts = {oo, oo}; {a, b} <s {c, d} iff a < c and b < d 
Li ={K X K, J_i , Tj, <i, V;}, where K is the set of real numbers, and 
_L; = {nan, nan}; T; = {— oo,oo}; 

{a, b} <i {c, d} iff {a, b} = _L; or (c < a and b < d) 



A type calculator implements the transition functions for all the known AST 
types. A fairly large number of rules are required due to the many built-in 
MATLAB functions. 



Convergence Issues. The monotonic data analysis framework is guaranteed 
to converge if the lattices involved are finite. Our lattices are finite (due to the 
fact that real numbers are represented in finite precision), but very large and, 
therefore, analysis can take a very long time. A simple example is presented be- 
low. The range of x keeps expanding in steps of 1.0, and it takes an impractically 
large number of iterations to reach the stationary value of -l-oo. 
while(...), 

X = X -I- 1; 
end 

A similarly simple example can be made up with a shape that grows very 
slowly. 

while(...), 

X = [x x]; 
end 

“Runaway” types keep changing as the iterations progress, and hence are 
easily detectable. In these cases MaJIC sets the types in question to T to force 
convergence. 



4.3 Just-in-Time Analysis 

In MaJIC analysis is performed very late, just before the code is executed. This 
late timing benefits the analyzer by contributing items of information to which 
a static analyzer doesn’t have access. 

— We can assume that the files in the MATLAB search path, and hence the 
names of the defined user functions, are not going to change between compi- 
lation and execution. This allows for greater precision in the symbol disam- 
biguation step. There is an underlying assumption that the MATLAB code 
is not self- modifying. 
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— The analyzer has access to the entire interpreter state at the point where 
the compiler is invoked, including knowledge of the exact types and values 
of all defined variables. This translates into knowledge of all defined types at 
the point of entry into the compiled code. The extra information reduces the 
requirement for heavy analysis because very simple techniques yield good 
results. For example, there is no symbolic analysis in MaJIC. 

Constant propagation is achieved as a side-effect of range propagation. A 
value is constant if its range contains a single number. Because the input infor- 
mation is so precise, many scalar values that would be symbolic parameters in 
static analysis end up as constants. 

Exact shapes are inferred in MaJIC by adding another lattice to the type 
inference engine. Normally, shape inference delivers only upper bounds on the 
shape of MATLAB arrays; we append a dual lattice Lt to calculate lower bounds 
too. An exact shape results where the lower and upper bounds are determined 
to be equal. 

Induction variables’ ranges can also be used to infer exact shapes. Given 
an array index expression like A(x), the range of x is used to determine the 
shape of A. Normally, this is an upper bound since there is no guarantee that 
X actually reaches the limits of its range. An array index can only be used to 
determine a lower bound if it assumes its range limits at least once. In a loop 
without premature exit instructions, the induction variable is guaranteed to hit 
its range limits. This “boundary-hitting” property can be propagated through 
index expressions that combine induction variables and constants to determine 
lower bounds for the shapes of arrays. 

There are a number of ways in which exact shapes can be used to achieve 
better performance. For example, array bound checks can be removed wherever 
the array index expression’s upper range limit is less than the lower bound of the 
corresponding array shape. On the right-hand side, this removes the requirement 
for mandatory error testing; on the left-hand side, it removes the auto-resizing 
check. 

MATLAB’s own companion compiler, mcc, has command-line switches to dis- 
able subscript checks, but using this option does not guarantee that the compiled 
program will work correctly. MaJIC removes subscript checks conservatively. 
The extra effort for just-in-time analysis is very low, comparing favorably with 
more conventional techniques such as those described in m- 



4.4 Code Generation 

The design of the code generator is based on the trade-off between compilation 
speed and the quality of generated code, as well as a desire to be able to plug 
in several native code generators. We chose a two-layered approach: a high- 
level code selector traverses the AST and chooses the appropriate native code 
generating routines, which then emit native instructions. 
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The Code Selector traverses the AST and reads the annotations associated 
with each node. It performs a top-down pattern matching to select the appropri- 
ate implementation for any given AST node. For example, for a multiplication 
node there are a number of possible implementations, including scalar multipli- 
cation, dot product, vector-matrix multiplication and complex vector multipli- 
cation, to mention just a few. 

Here is a short summary of the non-trivial techniques used by the code se- 
lector. 

— Preallocation of temporary arrays, when their maximal sizes are known at 
compile time. Because of the Fortran 90-like expression semantics, compli- 
cated MATLAB vector expressions involve temporary matrices at each step, 
and preallocating these eliminates many calls to mallocO. For example, in 
an expression like a x 6 x c, at least one temporary matrix is needed to hold 
the result of a x 6. If upper bounds are known for the shapes of a and b, this 
temporary can be pre-allocated. 

— Elimination of temporaries. Techniques have been developed to deal with 
this in Fortran 90-like languages EH MaJIC takes a simple way out and 
performs code selection to combine several AST nodes into a single library 
call. For example, expressions like a*x+b ’ *c may be transformed into a single 
call to dgemv. 

— When indexed with out-of-bound subscripts, MATLAB arrays attempt to 
resize themselves to include the offending subscript. Thus, in some loops, 
arrays are resized in each iteration with the associated performance penalty. 
We implemented a simple, but effective, strategy of allocating about 10% 
more memory than needed whenever an array is resized dynamically. Since 
dynamic resize tends to happen in loops, the surplus of memory will likely be 
used up later, meanwhile reducing the number of calls to malloc () . Speedups 
from this technique can reach two orders of magnitude or more, depending 
on the characteristics of the loop. 

— Complete unrolling of vector operations when exact array shapes are known. 
This technique is most effective on small (2x1 to 3x3) matrices and 
vectors because it eliminates all loop control instructions. In the future, we 
are planning to expand this technique to completely unroll small MATLAB 
loops with exact loop bounds. 

The Native Code Generator emits code into a string buffer, and then the 
buffer is simply executed. Native code generation is also a multi-pass process. 

Unfortunately there is no really general-purpose dynamic code generator that 
suits our purposes. The closest contender is Dawson Engler’s vcode jOj, a dy- 
namic assembler with its own RISC-like instruction set. vcode has been ported 
to many popular CPU architectures including spare and mips, but the x86 port 
is useless because it doesn’t include the floating point instructions. 

tcc m is a C-like language layered on top of vcode. It adds a lot of syn- 
tactic sugar and low-level optimizations like peephole optimization and register 
allocation, all performed at runtime. We used tcc for our implementation. 
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The second choice for the MaJIC backend is Paolo Bonzini’s GNU Light- 
ning 0 , a dynamic assembler with a platform-independent instruction set. It 
is ported to the Sparc, PPG and IA32 architectures. GNU Lightning is based 
on Ian Piumarta’s ccg dynamic assembler. The problem with GNU Lightning is 
that in striving to be architecture independent it restricts the user to 6 integer 
registers and a stack-based floating point system. These restrictions degrade 
performance on RISC architectures. 



4.5 Repository Management 

Since MATLAB is inherently polymorphic, the same source code can have several 
analyses attached to it, depending on how it is called. The repository keeps these 
analyses in order and tries to match existing compiled code to any call that is 
made. The concept of a match is defined very simply, based on the input data 
used by the analyzer: a given interpreter state matches an analysis stored in the 
repository if all the affected symbols have a “smaller” (<) type than the ones 
used by the analysis. For example a function compiled for complex arguments 
can be reused when invoked with real actual arguments. 

If a function is invoked many times with incompatible arguments, each in- 
vocation might require a new analysis, thereby diminishing performance. To 
increase the change of future matches, MaJIC’s analyzer has the option of gen- 
eralizing the input types before starting processing. This, of course, also results 
in a performance penalty for the compiled code. 

MaJIC uses a gradual strategy for changing types. The first analysis is per- 
formed with the types as they are; the second analysis discards range information 
and exact shapes. If a third call is performed and none of the performed anal- 
yses match, all inputs are treated as matrices; finally, if yet another analysis is 
needed, it is performed with no information whatsoever from the input. 

In real life, most MATLAB invocations are monomorphic; none of our bench- 
mark functions need more than two analyses. 



5 Experimental Results 

To gauge the performance of the MaJIC compiler, we measured both the speed 
of the generated code and the time spent compiling code. Our frame of reference 
for execution speed is set by the MATLAB interpreter. We measured the perfor- 
mance of three compilers: MATLAB’s own compiler mcc, the FALCON compiler, 
and MaJIC. 

Surprisingly, for FALCON we obtained speedup figures that were better than 
those reported in |S| . We attribute this to the better Fortran 90 compiler we used 
as well as to a CPU with a large amount L2 cache that rewards the inherently 
better locality of compiled code. 
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5.1 The Benchmarks 

We ran our experiments on the benchmarks originally gathered by Luiz DeRose. 
These are described in p], so we will limit ourselves to a short description. The 
12 benchmarks are all numerically intensive and relatively short: between 25 and 
125 lines of code. They are as follows: 

— cgopt, icn, qmr, and sor are linear equation solvers. 

— crnich, dirich, f inedif , galrkn, orbec, and orbrk are PDE solvers: finite 
difference solvers, a FEM solver, and n-body solvers. 

— adapt is an adaptive quadrature algorithm. 

— mei is a fractal generator. 

The matrices in the benchmarks have sizes between 50 x 50 and 450 x 450. 
The problem sizes were originally determined to yield approximately 1 minute 
run times, but newer computers tend to finish them faster. We ran all our experi- 
ments on a Sun workstation with Solaris 2.7 and a 270MHz Ultrasparc processor. 

5.2 Performance Figures 

Figure [U shows the speedups of mcc, MaJIC, and FALCON on a logarithmic 
scale. These speedups are based on execution times only (i.e., they don’t take 
into account the compilation times). Missing bars mean speedups are below 1.0. 

Execution times tell only half the story, however. Figure El shows the time 
spent from one user prompt to the next in each benchmark (i.e. the time spent 
by the user waiting for the result). For MaJIC and mcc, we assume that the 
compilation time is included in the prompt-to-prompt time; mcc tends to be used 
a lot from the MATLAB prompt, and MaJIC incurs the compilation penalty 
by default. We show two sets of speedups for FALCON, with and without com- 
pilation time included. 

As the figures show, MaJIC’s performance falls between mcc and FAL- 
CON when judged on execution time alone. However, MaJIC takes much less 
time to compile than mcc, making it a better choice when the code is edited - 
and thus recompiled - frequently. 

We can classify the benchmarks based on which compilation techniques in- 
fluence their performance the most. 

— The adapt benchmark benefits from the dynamic over-allocation scheme 
mentioned in section Ol 

— The cgopt and qmr benchmarks benefit from a better implementation of the 
dgemv routine, as they spend more than 99% of their time doing matrix- 
vector multiplication, qmr derives additional benefits from an exact match 
between the MATLAB code and the API of the dgemv function. These bench- 
marks do not benefit much from type analysis; we get similar performance 
numbers with type analysis turned off. 

— The crnich and orbrk functions suffer because MaJIC does not currently 
implement function inlining. 
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Fig. 1. speedups based on execution time only 



— crnich, dirich, finedif, galrkn, and orbec suffer from the lack of low- 
level compiler optimizations in the MaJIC code generator (e.g., common 
subexpression elimination and strength reduction) . We recompiled the FAL- 
CON codes with the Fortran compiler optimizations turned off (f90 -g). 
Figure 01 compares the unoptimized FALCON codes with MaJIC speedups. 
The MaJIC numbers include compile time. The performance numbers are 
pretty close to each other. Just-in-time analysis and high level code genera- 
tion techniques in MaJIC compensate for the time lost while compiling. 



We measured the effectiveness of just-in-time analysis by running the analyzer 
on a subset of our benchmarks, in just-in-time mode and in simulated static 
mode. In just-in-time, mode the analyzer had access to the interpreter state just 
before starting compilation; in static mode, the analyzer was cut off from the 
interpreter and ran with no outside information. Of 500 AST expression nodes, 
the just-in-time analyzer determined 402 to be scalars; by comparison, static 
analysis determined only 240 of them to be scalar. 

Just-in-time analysis is reasonably fast. Analysis involves 2 to 20 traversals of 
the abstract syntax tree. The 270MHz Sparc processor that ran the benchmarks 
took about 0.2 milliseconds per iteration to process an AST node; average com- 
pilation time (including disambiguation, type inference, and code generation) is 
about 170 milliseconds. 
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Fig. 2. speedups based on execution + compile time 



Figure 2] shows the relative composition of execution time in MaJIC. Dis- 
ambiguation and type inference never take more than 15% of the total time. 
Actual execution of the code dominates the time spent between user prompts. 
Code generation can take up to 40% of the time in the case of benchmarks that 
have large speedups. 

6 Conclusion and Future Plans 

We have built and tested a just-in-time compiler for MATLAB. It provides ade- 
quate performance while reducing the compilation overhead by at least an order 
of magnitude. 

There is still a lot of room for improvement. The quality of the compiled code 
is mediocre at this point, due to the fact that the dynamic compiler/ assembler 
combinations we use for code generation are not doing any low-level optimiza- 
tions. Experiments show that by performing common subexpression elimination 
on the dynamically generated code we could improve performance by at least 
anoher 50some of the benchmarks; we are planning to implement these next. 

Another technique we are planning to implement is to overlap compilation 
with computation. One of the ways to do this is to perform hybrid compilation, 



MaJIC: A Matlab Just-In-time Compiler 



79 




Fig. 3. Speedups without F90 optimizations 



i.e. analyze and compile as much of the code in advance as soon as it is available, 
and to generate semi-compiled templates that can be quickly refined by the JIT 
compiler. 

We also have other techniques under consideration. Some of the MATLAB 
libraries can be replaced by better ones EDmH; some of the loops can be executed 
in parallel on SMP machines. FALCON performance figures show that function 
inlining is a very cost-effective technique. 
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Abstract. State-of-the-art run-time systems are a poor match to diverse, dynamic 
distributed applications because they are designed to provide support to a wide 
variety of applications, without much customization to individual specific re- 
quirements. Little or no guiding information flows directly from the application to 
the run-time system to allow the latter to fully tailor its services to the application. 
As a result, the performance is disappointing. To address this problem, we pro- 
pose application-centric computing, or SMART APPLICATIONS. In the executable 
of smart applications, the compiler embeds most run-time system services, and 
a performance-optimizing feedback loop that monitors the application’s perfor- 
mance and adaptively reconfigures the application and the OS/hardware platform. 
At run-time, after incorporating the code’s input and the system’s resources and 
state, the SmartApp performs a global optimization. This optimization is instance 
specific and thus much more tractable than a global generic optimization between 
application, OS and hardware. The resulting code and resource customization 
should lead to major speedups. In this paper, we first describe the overall archi- 
tecture of Smartapps and then present the achievements to date: Run-time opti- 
mizations, performance modeling, and moderately reconfigurable hardware. 



1 Introduction 

Many important applications are becoming large consumers of computing power, data 
storage and communication bandwidth. For example, applications such as ASCI multi- 
physics simulations, real-time target acquisition systems, multimedia stream processing 
and geographical information systems (GIS), all put tremendous strains on the compu- 
tational, storage and communication capabilities of the most modern machines. There 
are several reasons why the performance of current distributed, heterogeneous systems 
is often disappointing. First, they are difficult to fully utilize because of the hetero- 
geneity of the processing nodes which are interconnected through a non-homogeneous 
network with different inter-node latencies and bandwidths. Secondly, the system may 
change dynamically while the application is running. For example, nodes may fail or 
appear, network links may be severed, and other links may be established with different 
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CCR-9624315, NSF Grant ACI-9872126, NSF-NGS EIA-9975018, DOE ASCI ASAP Level 
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latencies and bandwidths. Finally, in order to obtain decent performance, the work has 
to be partitioned in a balanced manner. 

Current distributed systems have a fairly compartmentalized approach to optimiza- 
tion: applications, compilers, operating systems and even hardware conhgurations are 
designed and optimized in isolation and without the knowledge of input data. There is 
too little information flow across these boundaries and no global optimization is even at- 
tempted. For example, many important activities managed by the operating system like 
paging activity, virtual-to-physical page mapping, I/O activity or data layout in disks 
are provided with little or no application customization. Since the compiler’s analysis 
can discover much about an application’s needs, performance could be boosted signif- 
icantly if the OS provided hooks for the compiler, and possibly the user, to customize 
or tailor OS activities to the needs of a particular application. Current hardware is huilt 
for general purpose use to lower costs and has almost no tunable parameters that allow 
the compiler or the OS adjust it to specific application characteristics. 

In addition to this lack of compiler/OS/hardware cooperation, a second important 
problem is that compilers do not necessarily know fully at compile time how an applica- 
tion will hehave at run time because the run-time behavior of an application may partly 
depend on its input data. Consequently, compilers may generate conservative code that 
does not take advantage of characteristics of the program’s input data. This precludes 
many aggressive optimizations related to code parallelization, and redundancy elimi- 
nation. Moreover, we can only use expensive, generic methods for load balancing and 
memory latency hiding. If, instead, the compiler inserted code that, after reading the 
input data to the program at run-time, adaptively made optimization decisions, perfor- 
mance could he boosted significantly. Furthermore, at a higher level, the compiler may 
have the possibility of selecting an algorithm from a library of functionally equivalent 
modules. If this choice is made based on the specific instance of an application then 
large-scale gains can be obtained. For example, if the code calls for a sorting routine, 
the compiler can specialize this call to a specific parallel sort that matches both the input 
data to be sorted as well as the architecture on which it will be executed. 

Our ultimate goal is the overall minimization of execution time of dynamic appli- 
cations in parallel systems. Instead of building individual, generally optimized compo- 
nents (compilers, run-time systems, operating systems, hardware) that can work accept- 
ably well with any application, we will subordinate the whole optimization process to 
the particular needs of a specihc application. We will drive the optimization with the 
requirements of an individual program and for a specific set of input data. Moreover, 
the optimization will be carried out continuously to adapt to the dynamic, time varying 
needs of the application. The hnal form of the executable of an application will take 
shape only at run-time, after all input data has been analyzed. The resulting smart ap- 
plication (SmartApp) will monitor its performance and, when necessary, restructure 
itself and the underlying OS and hardware to its new characteristics. While this method 
may cost some additional overhead for every execution the resulting customized per- 
formance can more than pay off for long running codes. 
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Application 




Fig. 1. Smart Application. 




Fig. 2. ToolBox. 



2 System Architecture 

We now give a general overview of our system which includes components at various 
levels of development. Some features of SmartApps have been implemented, others 
have been studied but have not yet been prototyped, while others are still in early stages. 
We give this high level architectural description that includes both accomplishments 
as well as work in progress in order to put our work in perspective. In the following 
sections we discuss in more detail those components that are in a more advanced state. 

The adaptive run-time system, shown in Figure sQ] and Q consists of a nested multi- 
level adaptive feedback loop that monitors the application’s performance and, based 
on the magnitude of deviation from expected performance, compensates with various 
actions. Such actions may be run-time software adaptation, re-compilation, or operating 
system and hardware reconfiguration. The system shown in Figure Q] uses techniques 
from a ToolBox shown in Figured The ToolBox contains application and system 
specihc databases and algorithms for performance evaluation, prediction and system 
reconfiguration. The tools are supported by architectural and performance models. 

The first stage of preparing a dynamic application for execution occurs outside the 
proposed run-time system. It is a pre-compilation in which all possible static compiler 
optimizations are applied. However, for many of the more aggressive and effective code 
transformations, the needed information is not statically available. For example, if the 
code solves a sparse adaptive mesh-refinement problem, the initial mesh is read from an 
input file only at the beginning of the execution and is therefore not available for static 
compilation. In this case, the compiler may use speculative transformations which will 
be validated at run-time. We will generate an intermediate code that will contain all the 
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necessary compiler-internal information statically available, which will be combined 
with execution- time information to finish possible optimizations. Calls to generic algo- 
rithms or, when possible, parallel algorithm recognition and substitutions will be either 
left in their most general form or specialized to the extent permitted by static compiler 
analysis, e.g., type analysis IC3|. 

The second stage in an application’s life is driven by the run-time system. It starts 
by reading in and/or sampling the input data which are relevant to the ’unfinished’ 
optimizations. This ’relevant’ data is analyzed with fast, approximative methods and 
essential characteristics are extracted. The result of this analysis will place the instance 
of this application in a certain ’functioning domain’ which represents the possible uni- 
verse of forms that an application can take at run-time. Calls to routines that perform 
certain standard functions will be specialized by selecting from a linked library the al- 
gorithms and/or their implementations that match the ’functioning domain’ (code and 
data) of this particular instantiation of the program. 

Then, a fast RUN-TIME COMPILER, which will be developed from an existing re- 
structurer, will finish the compilation process and generate a highly optimized and 
adaptable code, the Smart Application. This executable will include code for adap- 
tive run-time techniques that allow the application to make on-the-fly decisions about 
various optimizations. To this end, we will use our techniques for detecting and ex- 
ploiting loop level parallelism in various cases encountered in irregular applications 
dEniiii- Load balancing will be achieved ihmugh feedback guided blocked schedul- 
ing @ which allows highly imbalanced loops to be block scheduled by predicting a 
good work distribution from previous measured execution times of iteration blocks. 

For certain simple algorithms, which can be automatically recognized, e.g., reduc- 
tions, the compiler will insert code that can substitute the sequential version with a 
parallel equivalent that best matches the data access pattern of the application. This 
adaptive parallel algorithm substitution technique can be implemented either through 
multi- version code (library calls) as is currently done, or through recompilation. 

The result of static and dynamic compiler analysis of the application will also en- 
able the program to call upon a tunable, modular OS to change some of its parameters 
(e.g., page mapping) and to perform some simple modification of the underlying archi- 
tecture (e.g., type and/or number of system components). During this code generation 
phase, the compiler will generate (statically or at run-time) a list of specifications for the 
run-time environment. These application-level specifications are passed to the system 
configuration optimizer. The PREDICTOR and OPTIMIZER tools will use the applica- 
tion requirements and characteristics to compute an ‘optimal’ architectural configura- 
tion and tune the environment accordingly. In addition to the OS tuning we can perform 
architectural modifications when feasible. As we show in Section 0 we have simulated 
the possibility of customizing communication protocols (e.g., specialized cache coher- 
ence protocols). In the future we hope to be able specialize processors for computing or 
communication and distribute the workload between ’classical’ processors and proces- 
sors in memory (IRAM). 

In the next sections we elaborate on some of the currently implemented components 
of the presented SmartApps architecture. 
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3 Compiler Generated Run-Time Optimizations 

The Smart Application mainly consists of a run-time library embedded by the com- 
piler in the application and which can dynamically select compiler optimizations (e.g., 
loop parallelization or scheduling for load balance). Some non-intrusive architectural 
reconfiguration and operating system level tuning may also be employed to obtain fast, 
low overhead performance improvement. We plan to integrate such adaptive techniques 
into the application by extending current static and run-time technologies and by de- 
veloping completely new ones. In the following sections we detail some of these opti- 
mization methods and show how they can be incorporated into an integrated adaptive 
system for dynamic, heterogeneous computing. 

3.1 Run-Time Parallelization 

We have developed several techniques IHH1II3E1 that can detect and exploit loop 
level parallelism in various cases encountered in irregular applications: (i) a specula- 
tive method to detect fully parallel loops (The LRPD Test), (ii) an inspector/executor 
technique to compute wavefronts (sequences of mutually independent sets of iterations 
that can be executed in parallel) and (iii) a technique for parallelizing while loops (do 
loops with an unknown number of iterations and/or containing linked list traversals). 
We now briefly describe the utility of some of these techniques; details of their design 
can be found in IBIIHEOI and other related publications. 

Partially Parallel Loop Parallelization. We have previously developed a run-time 
technique for finding an optimal parallel execution schedule for a partially parallel loop 
IIEI- Given the original loop, the compiler generates inspector code that performs run- 
time preprocessing (based on a sorting algorithm) of the loop’s access pattern, and 
scheduler code that schedules (and executes) the loop iterations. The inspector is fully 
parallel, uses no element-wise synchronization, and can implement at run-time array 
privatization and reduction parallelization. Unfortunately this method is not generally 
applicable because a proper, side-effect free inspector cannot be extracted from a loop 
where address and data computation form a dependence cycle. 

The Recursive LRPD Test. In previous work we have introduced the LRPD test for 
DOALL parallelization which speculatively executes a loop in parallel and tests sub- 
sequently if any data dependences could have occurred m. If the test fails, the loop 
is re-executed sequentially. To qualify more parallel loops, array privatization and re- 
duction parallelization can be speculatively applied and their validity tested after loop 
termination. It can be shown that if the LRPD test passes, i.e., the loop is in fact fully 
parallel, then a significant portion of the ideal speedup of the loop is obtained. The draw- 
back of this method is that if the test fails a slowdown equal to the parallel speculative 
execution of the loop may be experienced. 

We have developed a new technique that can extract the maximum available par- 
allelism from a partially parallel loop and that can be applied to any loop (even if no 
proper inspector can be extracted) and requires less memory overhead. The main idea 
of the Recursive LRPD test | B | is that in any block- scheduled loop executed under the 
processor-wise LRPD test with copy-in, the chunks of iterations that are less than or 
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equal to the source of the first detected dependence arc are always executed correctly. 
Only the processors executing iterations larger or equal to the earliest sink of any de- 
pendence arc need to re-execute their portion of work. Thus only the remainder of the 
work (of the loop) needs to be re-executed, which can represent a significant saving over 
the previous LRPD test method (which would re-execute the whole loop sequentially). 

To re-execute the fraction of the iterations assigned to the processors that may have 
worked off erroneous data we need to repair the unsatisfied dependences. This can be 
accomplished by initializing their privatized memory with the data produced by the 
lower ranked processors. Alternatively, we can commit (i.e., copy-out) the correctly 
computed data from private to shared storage and use on-demand copy-in during re- 
execution. We then re-apply recursively the fully parallel LRPD test on the remaining 
iterations until all processors have correctly finished their work. For loops with few 
cross-processor dependences we can expect to finish in only a few parallel steps. We 
have used two different strategies when re-executing a part of the loop: We can re- 
execute only on the processors that have incorrect data and leave the rest of them idle 
(NRD), or, we can redistribute at every stage the remainder of the work across all pro- 
cessors (RD). There are pros and cons for both approaches. Through redistribution of 
the work we employ all processors all the time and thus the execution time of every 
stage decreases (instead of staying constant, as in the NRD case). The disadvantage is 
that we may uncover new dependences across processors which were satisfied before 
by executing on the same processor. Moreover, there is a ’hidden’ but potentially large 
cost associated with work redistribution: more remote misses during loop execution due 
to data redistribution between the stages of the test. The worst case time complexity for 
no redistribution (NRD) is the cost of a sequential execution. There are at most p steps 
performing n/p work, where p is the number of processors and n is the number of iter- 
ations. In the RD (with redistribution) case we will take progressively less time because 
we execute in p processors decreasing the amount of work. The number of steps is heav- 
ily dependent on the distribution of data dependences of the loop. For example, if we 
assume that at every step we perform correctly 1/2 the work then the total time is less 
than twice the fully parallel execution time of the loop. In practice we have obtained bet- 
ter results by adopting a hybrid method which redistributes until the predicted execution 
time of the remaining work is less than the overhead associated with re-distribution. In 
other words, we redistribute until the potential benefit of using more processors is out- 
weighed by the cost. From that point on we continue without redistribution. A potential 
drawback is that the loop needs to be statically block scheduled in increasing order of 
iteration. The negative impact of this limitation can be reduced through dynamic feed- 
back guided scheduling |i7l|. By applying this new method exclusively we can remove 
the uncertainty or unpredictability of execution time associated with the LRPD test - 
we can guarantee that a speculatively parallelized program will run at least as fast as its 
sequential version with some additional (minor) testing overhead. 

We have implemented the Recursive LRPD test in both RD and NRD flavors and 
applied it to the three most important loops in TRACK, a Perfect code. The implementa- 
tion is partially done by our run-time pass in Polaris (to automatically apply the simple 
LRPD test) and then additional code has been inserted manually. Our experimental test- 
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bed is a 16 processor ccUMA HP-V2200 system running HPUXl 1. It has 4Gb of main 
memory and 4Mb single level caches. 

The main loops in TRACK are EXTEND_400, NLEILT.300 and FPTRACK_300. 
They account for « 90% of sequential execution time. We have increased the input set 
to increase the execution time as well as all associated data structures. The degree of 
parallelism in the loop from NLEILT is very input sensitive and ranges from fully par- 
allel to a significant number of cross-processor dependences. All considered loops are 
very load imbalanced and thus, until our feedback guided load balancing is fully imple- 
mented, causes low speedups. Figures 0 (a-c) show the speedups for individual loops 
and Figure 0d) shows the speedup for the entire program. Previous to this technique 
this code was considered sequential. 



EXTEND_do400 Speedup NLFILT_do300 Speedup 

with Feedback Guided Biock Scheduiing and Redistribution with Aii Optimizations 






(c) 



Fig. 3.. 



(d) 



3.2 Adaptive Algorithm Selection: Choose the Right Method for Each Case 

Memory accesses in irregular programs take a variety of patterns and are dependent on 
the code itself as well as on their input data. Moreover, some codes are of a dynamic 
nature, i.e., they modify their behavior during their execution. For example, they might 
simulate position dependent interactions between physical entities. 

A special and very frequent case of loop dependence pattern occurs in loops which 
implement reduction operations. In particular, reductions (also known as updates) are 
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at the core of a very large number of algorithms and applications - both scientific and 
otherwise - and there is a large body of literature dealing with their parallelizationQ 
It is difficult to a find a reduction parallelization algorithm (or for that matter, other 
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Iw > rep > 11 > sel 


2,000,000 


0.25 


1 


0.26 


sel 


sel > Iw > 11 > rep 


Nbf 
-DO 50 


1 


25,600 


25 


200 


0.25 


11 


sel > 11 > rep > Iw 


128,000 


6.25 


50 


0.25 


sel 


sel > 11 > rep > Iw 


256,000 


0.625 


5 


0.25 


sel 


sel > 11 > rep > Iw 


1,280,000 


0.25 


2 


0.25 


sel 


sel > 11 > rep > Iw 


Moldyn 

- ComputeForces loop 


2 


16,384 


23.94 


95.75 


0.41 


rep 


rep > 11 > sel > Iw 


42,592 


7.75 


31 


0.36 


rep 


rep > 11 > sel > Iw 


70,304 


1.69 


6.75 


0.33 


11 


11 > rep > sel > Iw 


87,808 


0.375 


1.5 


0.29 


11 


11 > rep > sel > Iw 


Spark98 

- smvpthreadO loop 


1 


30,169 


0.625 


5 


0.18 


sel 


sel > 11 > rep > Iw 


7,294 


0.6 


4.8 


0.2 


sel 


11 > sel > rep > Iw 


Charmm 
-DO 78 


2 


332,288 


35.88 


17.9 


0.14 


sel 


11 > sel > rep > Iw 


17.94 


8.97 


0.15 


sel 


11 > sel > rep > Iw 


664,576 


1.12 


4.48 


0.13 


sel 


11 > sel > rep > Iw 


Spice 
- bjtlOO 


28 


186,943 


0.14 


0.04 


0.125 


hash 


hash > 11 > rep 


99,190 


0.20 


0.06 


0.125 


hash 


hash > 11 > rep 


89,925 


0.16 


0.05 


0.125 


hash 


hash > 11 > rep 


33,725 


0.16 


0.05 


0.126 


hash 


hash > 11 > rep 



Fig. 4. The data has been obtained from the execution of the applications 8 processors. DIM: 
number of reduction elements; SP: sparsity; CON: connectivity; CHR: ratio of total number of 
references to space needed for per processor replicated arrays; MO: mobility. 



optimizations) that will work well in all cases. We have designed an adaptive scheme 
that will detect the type of reference pattern through static (compiler) and dynamic 
(run-time) methods and choose the most appropriate scheme from a library of already 
implemented choices Id- To find fhe best choice we establish a taxonomy of different 
access patterns, devise simple, fast ways to recognize them, and model the various old 
and newly developed reduction methods in order to find the best match. The charac- 
terization of the access pattern is performed at compile time whenever possible, and 
otherwise, at run-time, during an inspector phase or during speculative execution. 

From the point of view of optimizing the parallelization of reductions (i.e., selecting 
the best parallel reduction algorithm) we recognize several characteristics of memory 

^ A reduction variable is a variable whose value is used in one associative and commutative 
operation of tbe form x = x ® exp, where (g) is the operator and x does not occur in exp or 
anywhere else in the loop. 
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references to reduction variables. CH is a histogram which shows the number of ele- 
ments referenced by a certain number of iterations. CHR is the ratio of the total num- 
ber of references (or the sum of the CH histogram) and the space needed for allocating 
replicated arrays across processors. CON, the Connectivity of a loop, is a ratio be- 
tween the number of iterations of the loop and the number of distinct memory elements 
referenced by the loop m. The Mobility (MO) per iteration of a loop is directly pro- 
portional to the number of distinct elements that an iteration references. The Sparsity 
(SP) is the ratio of referenced elements to the dimension of the array. The DIM measure 
gives the ratio between the reduction array dimension and cache size. If the program is 
dynamic then changes in the access pattern will be collected, as much as possible, in an 
incremental manner. When the changes are significant enough (a threshold that is tested 
at run-time) then a re-characterization of the reference pattern is needed. 

Our strategy is to identify the regular components of each irregular pattern (includ- 
ing uniform distribution), isolate and group them together in space and time, and then 
apply the best reduction parallelization method to each component. We have used the 
following novel and previously known parallel reduction algorithms; local write (Iw) 
m (an ’owner compute’ method), private accumulation and global update in repli- 
cated private arrays (rep), replicated buffer with links (11), selective privatization (sel), 
sparse reductions with privatization in hash tables (hash). Our main goal, once the type 
of pattern is established, is to choose the appropriate reduction parallelization algo- 
rithm, that is, the one which best matches these characteristics. To make this choice we 
use a decision algorithm that takes as input measured, real, code characteristics, and a 
library of available techniques, and selects an algorithm for the given instance. 

The table shown in FigElillustrates the experimental validation of our method. All 
memory reference parameters were computed at run-time. The result of the decision 
process is shown in the “Recommended scheme” column. The final column shows the 
actual experimental speedup obtained with the various reduction schemes which are 
presented in decreasing order of their speedup. For example, for Irreg, the model rec- 
ommended the use of Local Write. The experiments confirm this choice: Iw is listed as 
having the best measured speedup of all schemes. 

In the experiment for the SPICE loop, the hash table reduces the allocated and pro- 
cessed space to such an extent that, although the setup of a hash table is large, the 
performance improves dramatically. It is the only example where hash table reductions 
represent the best solution because of the very sparse nature of the references. We be- 
lieve that codes in C would be of the same nature and thus benefit from hash tables. 
There are no experiments with the Local Write method because iteration replication is 
very difficult due to the modification of shared arrays inside the loop body. 



4 The Toolbox: Modeling, Prediction, and Optimization 

In this section, we describe our current results in developing a performance PREDICTOR 
whose predictions will be used to select among various algorithms, and to help diagnose 
inefficiencies and identify potential optimizations. 

Significant work has been done in low-level analytical models of computer archi- 
tectures and applications EIHimiTTIl. While such analytical models had fallen out of 
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favor, being replaced by comprehensive simulations, they have recently been enjoying 
a resurgence due the need to model large-scale NUMA machines and the availability of 
hardware performance counters [O- However, these models have mainly been used to 
analyze the performance of various architectures or system-level behaviors. 

In (2l> we propose a cost model that we call F, which is based on values com- 
monly provided by hardware performance monitors, that displays superior accuracy to 
the BSP-like models (our results on the SGI PowerChallenge use the MIPS RIOOOO 
hardware counters iSlll l. Function F is defined under the assumption that the running 
time is determined by one of the following factors; (1) the accesses issued by some pro- 
cessor at the various levels of the hierarchy, (2) the traffic on the interconnect caused by 
accesses to main memory, or (3) bank contention caused by accesses targeting the same 
bank. For each of the above factors, we define a corresponding funcfion (FI, F2, and F3, 
resp.) which should dominate when that behavior is the limiting factor on performance. 
That is, we set F= max{ FI, F2, F3}. The functions are linear relations of values 
measurable by hardware performance counters, such as loads/stores issued, LI and L2 
misses and LI and L2 write-backs, and the coefficients are determined from micro- 
benchmarking experiments designed to exercise the system in that particular mode. 

A complete description of F, including detailed validation results, can be found in 
Q. We present here a synopsis of the results. The function F was compared with three 
J3SP-like cost functions based, respectively, on the Queuing Shared Memory (QSM) 
d and the (d, x)-BSP | 5 ' 1 , which both embody some aspects of memory contention, 
and the Extended BSP (EBSP) model M, which extends the BSP to account for un- 
balanced communication. Since the BSP-like functions do not account for the memory 
hierarchy, we determined an optimistic (min) version and a pessimistic (max) version 
for each function. The accuracy of the BSP-like functions and F were compared on an 
extensive suite of synthetic access patterns, three bulk- synchronous implementations of 
parallel sorting, and the NAS Parallel Benchmarks |T3|. Specifically, we determined 
measured and predicted times (indicated by Tm and Tp, respectively) and calculated 
the prediction error as ERR = ™tn{rM Tp} ’ which indicates how much smaller or 
larger the predicted time is with respect to the measured time. 

A summary of our findings regarding the accuracy of the BSP-like functions and F 
is shown in Tahles m^ where we report the maximum value of ERR over all runs (when 
omitted, the average values of ERR are similar). Overall, the F function is clearly su- 
perior to the BSP-like functions. The validations on synthetic access patterns (TableOl) 
underscore that disregarding hierarchy effects has a significant negative impact on ac- 
curacy. Moreover, Fs overall high accuracy suggests that phenomena that were disre- 
garded when designing it (such as some types of coherency overhead) have only a minor 
impact on performance. Since the sorting algorithms (Table|3l exhibit a high degree of 
locality, we would expect the optimistic versions of the BSP-like functions to perform 
much better than their pessimistic counterparts, and indeed this is the case (errors are 
not shown for FBSPmin and DXBSPmin because they are almost identical to the errors 
for QSMmin)- A similar situation occurs for the MPI-based NAS benchmarks (TableEll. 

Performance predictions from a HW counter-based model. One of the advantages 
of the BSP-like functions over the counter-based function F, is that, to a large extent, 
the compiler or programmer can determine the input values for the function. While the 
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Synthetic Access Patterns - ERRs 


Function 


AVG ERR 


MAX ERR 


QSM 


MIN 


24.15 


88.02 


MAX 


53.85 


636.79 


EBSP 


MIN 


24.15 


88.02 


MAX 


27.29 


648.35 


DXBSP 


MIN 


6.36 


31.84 


MAX 


34.8 


411.46 


F 


1.19 


1.91 



Table 1. Synthetic Access Patterns. 



Sorting Programs - MAX ERR 



Sort SStep 


QSM 


EBSP 


DXBSP 


F 


MIN 


MAX 


MAX 


MAX 


Radix 


SSI 


2.12 


400.14 


320.48 


258.55 


1.41 


SS4 


2.72 


321.56 


302.11 


207.78 


1.39 


Sample 


SS2 


2.17 


320.95 


252.25 


207.38 


1.15 


SS4 


2.89 


287.72 


247.31 


185.91 


1.11 


SS5 


2.58 


361.36 


327.08 


233.49 


1.26 


Column 


SSI 


3.44 


268.13 


205.23 


173.25 


1.06 


SS2 


2.46 


268.13 


264.49 


173.25 


2.05 


SS3 


2.88 


230.37 


228.11 


148.85 


1.88 


SS4 


2.61 


245.56 


247.10 


158.67 


2.09 


SS5 


1.36 


484.93 


280.03 


313.34 


1.16 



Table 3. Sorting algorithms for se- 
lected supersteps. 



NAS Parallel Benchmarks - MAX ERR 


NPB 


QSM 


EBSP 


DXBSP 


F 


MIN 


MAX 


MAX 


MAX 


CG 


2.46 


258.31 


210.10 


166.91 


1.46 


EP 


2.42 


262.53 


252.32 


169.64 


1.02 


ET 


2.05 


309.40 


245.64 


199.92 


1.63 


IS 


1.57 


404.81 


354.47 


261.57 


1.39 


LU 


2.15 


295.01 


236.80 


190.62 


1.32 


MG 


1.57 


403.48 


289.11 


260.71 


1.73 


BT 


2.77 


229.68 


189.08 


148.41 


1.05 


SP 


2.13 


298.69 


194.21 


193.00 


1.05 



Table 2. NAS Parallel Benchmarks. 



Sort SStep 


Errors for F 
(Measured) 


Errors for F 
(Estimated) 


AVG 


MAX 


AVG 


MAX 


Radix 


SSI 


1.22 


1.32 


1.01 


1.09 


SS4 


1.11 


1.16 


1.13 


1.16 


Sample 


SS2 


1.05 


1.09 


1.20 


1.21 


SS4 


1.06 


1.11 


1.03 


1.04 


SS5 


1.13 


1.17 


1.24 


1.26 


Column 


SSI 


1.12 


2.49 


1.17 


1.72 


SS2 


1.80 


1.89 


2.02 


2.12 


SS3 


1.66 


1.69 


1.84 


1.87 


SS4 


1.78 


1.83 


2.05 


2.06 


SS5 


1.16 


1.17 


1.88 


1.90 



Table 4. Sorting algorithms: compari- 
son of Ps accuracy with measured vs. es- 
timated counters. 



counter-based function exhibits excellent accuracy, it seems that one should actually 
run the program to obtain the required counts, which would annihilate its potential as a 
performance predictor. However, if one could guess the counter values in advance with 
reasonable accuracy, they could then be plugged into F to obtain accurate predictions. 
For example, in some cases meaningful estimates for the counters might he derived 
by extrapolating values for large problem sizes from pilot runs of the program on small 
input sets (which could be performed at run-time by the adaptive system). To investigate 
this issue, we developed least-squares fits for each of the counters used in F for those 
supersteps in our three sorting algorithms that had significant communication. The input 
size n of the sorting instance was used as the independent variable. For each counter, we 
obtained the fits on small input sizes {n/p = 10® • i, for 1 < i < 5), and then used the 
fits to forecast the counter values for large input sizes {n/p = 10® • i, for 5 < i < 10). 
These estimated counter values were then plugged in F to predict the execution times 
for the larger runs. The results of this study are summarized in Table El It can be seen 
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that in all cases, the level of accuracy of F using the extrapolated counter values was 
not significantly worse than what was obtained with the actual counter values. These 
preliminary results indicate that at least in some situations a hardware counter-based 
function does indeed have potential as an a priori predictor of performance. Currently, 
we are working on applying this strategy to other architectures, including the HP V- 
Class and the SGI Origin 2000. 



5 Hardware 

Smart applications exploit their maximum potential when they execute on reconfig- 
urable hardware. Reconfigurable hardware provides some hooks that enable it to work 
in different modes. In this case, smart applications, once they have determined their true 
behavior statically or dynamically, can actuate these hooks and conform the hardware 
for the application. The result is large performance improvements. 

A promising area for reconfigurable hardware is the hardware cache coherence pro- 
tocol of a CC-NUMA multiprocessor. In this case, we can have a base cache coherence 
protocol that is generally high-performing for all types of access patterns or behaviors 
of the application. However, the system can also support other specialized cache coher- 
ence protocols that are specifically tuned to certain application behaviors. Applications 
should be able to select the type of cache coherence protocol used on a code section 
basis. We provide two examples of specialized cache coherence protocols here. Each of 
these specialized protocols is composed of the base cache coherence transactions plus 
some additional transactions that are suited to certain functions. These two examples 
are the speculative parallelization protocol and the advanced reduction protocol. 

The speculative parallelization protocol is used profitably in sections of a program 
where the dependence structure of the code is not analyzable by the compiler. In this 
case, instead of running the code serially, we run it in parallel on several processors. The 
speculative parallelization protocol contains extensions that, for each protocol transac- 
tion, check if a dependence violation occurred. Specifically, a dependence violation will 
occur if a logically later thread reads a variable before a logically earlier thread writes 
to it. The speculative parallelization protocol can detect such violations because it tags 
every memory location that is read and written, with the ID of the thread that is per- 
forming the access. In addition, it compares the tag ID before the access against the ID 
of the accessing thread. If a dependence violation is detected, an interrupt runs, repairs 
the state, and restarts execution. If such interrupts do not happen too often, the code ex- 
ecutes faster in parallel with the speculative parallelization protocol than serially with 
the base cache coherence protocol. More details can be found in E3I2S1E1- 

The advanced reduction protocol is used profitably in sections of a program that 
contain reduction operations. In this case, instead of transforming the code to optimize 
these reduction operations in software, we simply mark the reduction variables and 
run the unmodified code under the new protocol. The protocol has extensions such 
that, when a processor accesses the reduction variable, it makes a privatized copy in its 
cache. Any subsequent accumulation on the variable will not send invalidations to other 
privatized copies in other caches. In addition, when a privatized version is displaced 
from a cache, it is sent to its original memory location and accumulated onto the existing 
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Fig. 5. Execution time improvements by reconfiguring the cache coherence protocol hardware to 
support (a) speculative parallelization, and (b) advanced reduction operations. 



value. With these extensions, the protocol reduces to a minimum the amount of data 
transfer and messaging required to perform a reduction in a CC-NUMA. The result is 
that the program runs much faster. More details can be found in E7I . 

We now see the impact of cache coherence protocol recongurability on execution 
time. Figure0compares the execution time of code sections running on a 16-processor 
simulated multiprocessor like the SGI Origin 2000 m- We compare the execution time 
under the base cache coherence protocol and under a reconfigurable protocol called 
SUPRA. In Figures 0a) and 0b) SUPRA is reconfigured to be the speculative paral- 
lelization and advanced reduction protocols, respectively. In both charts, for each appli- 
cation, the bars are normalized to the execution time under Base. 

From the figures, we see that the ability to reconfigure the cache coherence protocol 
to conform to the individual application’s characteristics is very beneficial. The code 
sections fhaf can benefit from the speculative parallelization protocol (Figure0a)), run 
on average 75% faster under the new protocol, while those that, can benefit from the 
advanced reduction protocol (Figure 0b)) run on average 85% faster under the new 
protocol. 



6 Conclusions and Future Work 

So far we have made good progress on the development of many the components of 
SmartApps. We will further develop these and combine them into an integrated system. 

The performance of parallel applications is very sensitive to the type and quality 
of operating system services. We therefore propose to further optimize SmartApps by 
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interfacing them with an existing customizable OS. While there have been several pro- 
posals of modular, customizable OSs, we plan to use the K42 m experimental OS from 
IBM, which represents a commercial-grade development of the TORNADO system 
fra HI. Instead of allowing users to actually alter or rewrite parts of the OS and thus 
raise security issues, the K42 system allows the selective and parametrized use of OS 
modules (objects). Additional modules can be written if necessary but no direct user 
access is allowed to them. This approach will allow our system to configure the type of 
services that will contribute to the full optimization of the program. 

So far we have presented various run-time adaptive techniques that a compiler can 
safely insert into an executable under the form of multiversion code, and that can adapt 
the behavior of the code to the various dynamic conditions of the data as well as that 
of the system on which it is running. Most of these optimization techniques have to 
perform a test at run-time and decide between multi-version sections of code that have 
been pre-optimized by the static compilation process. The multi-version code solution 
may, however require an impractical number of versions. Applications exhibiting par- 
tial parallelism could be greatly sped up through the use of selective, point-to-point 
synchronizations and whose placement information is available only at run-time. Mo- 
tivated by such ideas we plan on writing a two stage compiler. The first will identify 
which performance components are input dependent and the second stage will com- 
pile at run time the best solution. We will target what we believe are the most promising 
sources of performance improvement for an application executing on a large system: in- 
crease of parallelism, memory latency and I/O management. In contrast to the run-time 
compilers currently under development which mainly rely on the benefits of 

partial evaluation, we are targeting very high level transformations, e.g., parallelization, 
removal of (several levels) of indirection and algorithm selection. 
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Abstract. Traditional techniques for array analysis do not provide 
dataflow information, and thus traditional dataflow-based scalar opti- 
mizations have not been applied to array elements. A number of tech- 
niques have recently been developed for producing information about 
array dataflow, raising the possibility that dataflow-based optimizations 
could be applied to array elements. In this paper, we show that the value- 
based dependence relations produced by the Omega Test can be used as 
a basis for generalizations of several scalar optimizations. 



Keywords: array dataflow, constant folding, constant propagation, copy prop- 
agation, dead code elimination, forward substitution, unreachable code elimina- 
tion, value-based array dependences 



1 Introduction 

Traditional scalar analysis techniques produce information about the flow of 
values, whereas traditional array data dependence analysis gives information 
about memory aliasing. Thus, many analysis and optimization techniques that 
require dataflow information have been restricted to scalar variables, or applied 
to entire arrays (rather than individual array elements) . A number of techniques 
have recently been developed for accurately describing the flow of values in array 
variables QElOlEEliniBEl- These are more expensive than scalar dataflow 
analysis, and the developers (except for Sarkar and Knobe) have focused on opti- 
mizations with very high payoff, such as automatic parallelization. However, once 
the analysis has been performed, we can apply the result to other optimizations 
as well. 

In this paper, we give algorithms for dead-code elimination, constant prop- 
agation and constant folding, detection of constant conditions and elimination 
of unreachable code, copy propagation, and forward substitution for arrays. Our 
algorithms have two important properties: First, they work with elements of 
arrays and executions of statements, rather than treating each variable or state- 
ment as a monolithic entity. Second, they are not based on descriptions of finite 
sets of values, and thus (unlike the techniques of |S|) can detect the constant 
array in Figure Q^b) as well as that in Figure CJa). 

S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 37- rm 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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REAL*8 A(0:3) 



DO 10 J = 1, JMAX 
JPLUS(J) = J+1 
10 CONTINUE 



A(0) = -8.0D0/3.0D0 
A(l) = O.ODO 



IF ( .NOT. PERIDC ) THEN 



A(2) = 1.0D0/6.0D0 

A(3) = 1.0D0/12.0D0 



JPLUS(JMAX) = JMAX 
ELSE 



JPLUS(JMAX) = 1 
END IF 



(a) A in 107.MGRID |S] 



(b) JPLUS in initia in ARC2D |2| 



Fig. 1. Definitions of Constant Arrays 



2 Representing Programs with Integer Tuple Sets and 
Relations 

Information about scalar dataflow is generally defined at the statement level, 
without distinguishing different executions of a statement in a loop. In contrast, 
information about array dataflow, and optimizations based on this information, 
should be based on executions of each statement, as the set of elements used 
may vary with the loop iteration. We use the Omega Library m to represent 
and manipulate this information. 

The remainder of this section reviews our representations of iteration spaces, 
data dependences, and mappings from iterations to values, as well as the op- 
erations that we apply to these representations. Readers who are familiar with 
earlier work of the Omega Project may wish to skip to the new material, which 
begins with Sectional 

2.1 Representing Sets of Iterations 

We identify each execution of a given statement with a tuple of integers giving 
the indices of the surrounding loops. For example, consider the definition of 
JPLUS the ARC2D benchmark of the perfect club benchmark set 0 . We will refer 
to three defining statements as S (the definition in the loop), T (the definition in 
the then branch of the if), and U (the definition in the else branch). Statement 
S is executed repeatedly - we label the first execution [1], the second [2], etc., 
up to [JMAX], We can represent the set iterations{S), the of iterations of S, 
using constraints on a variable J (representing the loop index) : 



These constraints can refer to values not known at compile time (such as 
JMAX), allowing us to represent a set without having an upper bound on its size. 
The Omega Library lets us use arbitrary affine constraints (i.e., the expressions 
being compared can be represented by sums of constants and variables multiplied 



iterations(S) = { [J] | 1 < J < JMAX }. 
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by constant factors), though the library is not efficient over this entire domain. 
However, our previous experimental work indicates that, for the constraints that 
arise in practice, it can perform most of the operations discussed in this paper 
efficiently (see Section and 0 for details). 

We can use constraints to describe a set of loop iterations (as above) or the 
set of values of symbolic constants for which a statement executes. For example, 

iterationsiT) = { [ ] | ^PERIDC }. 

Executions of statements inside no loops correspond to zero-element tuples (as 
above); executions in nested loops correspond to multi-element tuples. The set 
of executions of the first assignment to JPl in Figure |2| (Statement V) is 

iterations{V) = { [A^, J] | 2 < < 4 A JLOW <J< JUP }. 



DO 300 N = 2,4 

DO 210 J = JLOW, JUP 
JPl = JPLUS(J) 

JP2 = JPLUS(JPl) 

DO 212 K = KL0W,KUP 
C 

W0RK(J,K,4) = -(C2 + 3.*C4 + C4M) *XYJ(JP1 ,K) 
W0RK(J,K,5) = XYJ(JP2,K)*C4 
212 CONTINUE 

210 CONTINUE 

C 

IF(.NOT.PERIDC)THEN 
J = JLOW 
JPl = JPLUS(J) 

JP2 = JPLUS(JPl) 

DO 220 K = KL0W,KUP 
C 

W0RK(J,K,4) = -(C2 + 3.*C4 + C4M) *XYJ(JP1 ,K) 
W0RK(J,K,5) = XYJ(JP2,K)+C4 
220 CONTINUE 

C 

C (uses of WORK array follow) 

Fig. 2. Uses of JPLUS in SOO.stepfx in ARC2D (Simplified) 



2.2 Representing Mappings 

We can use relations between integer tuples to describe mappings, such as a 
mapping from loop iterations to values. For example, the fact that iteration J 
of Statement S produces the value J -P 1 can be described with the relation 

value(S) = { [J] ^ [J -P 1] I 1 < J < JMAX }. 
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2.3 Representing Dependences as Relations 

Array data dependence information, whether based on memory aliasing or value 
flow, can also be described with integer tuple relations. The domain of the tuple 
represents executions of the source of the dependence, and the range executions 
of the sink. For example, if Figure |I](b) were followed immediately by Figure E| 
then the value-based dependence from the definition of JPLUS in Statement S to 
the use in Statement V, Ss^v, would be: 

Ss^v = { [^] ^ [N, J] I 1, JLOW <J< JUP, JMAX - 1 A 2 < < 4 } 

Note that the constraint J < JMAX — 1 indicates that there is no dependence 
from iteration [JMAX] of the definition, since that definition is killed by the 
conditional assignment. 

When all constraints are affine, the analysis we describe in gnu can produce 
exact value-based dependence relations. This occurs for any procedure in which 
control flow consists of for loops and ifs, and all loop bounds, conditions, 
and subscripts are affine functions of the surrounding loop indices and a set of 
symbolic constants. In other cases, our analysis may produce an approximate 
result. 

2.4 Operations on Sets and Relations 

Our algorithms use relational operations such as union, intersection, subtraction, 
join, and finding or restricting the range or domain of a relation. (The relational 
join operation, •, is simply a different notation for composition: AuB = Bo A.) In 
some cases, we manipulate sets describing iteration spaces, or relations giving the 
value-based dependences between two statements. Our experience with dataflow 
analysis indicates that we can generally create these sets and relations, and 
perform the above operations on them, efficiently [7]- 

In other cases, we discuss the manipulation of transitive dependences. The 
transitive value-based flow dependence from statement AtoC describes all the 
ways in which information from an execution of A can reach an execution of 
C. For example, there are transitive dependences from the definition of JPLUS 
in Figure m to the definition of WORK in Figure El since JPLUS is used in the 
definition of JPl, which is used in the definition of WORK. 

When there are no cycles of dependences 0, we can compute transitive de- 
pendences using composition. For example, if dependences run only from A to 
B and from B to C, we can compute the transitive dependence from A to C as 
Sa^b • 5b^c- 

If there is a dependence cycle, computing transitive dependences requires the 
transitive closure operation, defined as the infinite union 

R* = I U R U R»R U RuR»R ... 

^ Note that each dependence must point forward in time, so there can be no cycle of 
dependences among iterations. Thus, we use the term “cycle of dependences” to refer 
to a cycle of dependences among statements - an iteration of statement B depends 
on some iteration of A, and an iteration of A on some iteration of B. 



Extending Scalar Optimizations for Arrays 101 



where I is the identity relation. For example, there is a dependence from the 
second statement of Figure 0 to itself, and the transitive closure of this self- 
dependence shows that every iteration is dependent on every earlier iteration. 
Transitive closure differs from the other operations we perform in that it cannot 
be computed exactly for arbitrary affine relations, and in that we do not have 
significant experience with its efficiency in practice. Furthermore, finding all 
transitive dependences for a group of N statements may take 0{N^) transitive 
closure operations El. Thus, we provide alternative (less precise) algorithms in 
cases where our general algorithm would require transitive closure operations. 



JPLUS(l) = 2 
DO 10 J = 2, JMAX 

JPLUS(J) = JPLUS(J-1)+1 
10 CONTINUE 

IF ( .NOT. PERIDC ) THEN 
JPLUS(JMAX) = JMAX 
ELSE 

JPLUS(JMAX) = 1 
ENDIF 

Fig. 3. Alternative Definition of JPLUS, with Dependence Cycle 



2.5 Generating Code from Sets of Iterations 

Our optimizations make changes to a statement’s set of iterations. For example, 
since the value defined in iteration JMAX of Statement S is never used (it must 
be killed by the conditional definition that follows it), we can eliminate iteration 
JMAX. That is, we can change the iteration space so that it ends at JMAX — 1 
rather than JMAX: 

iterations{S) := { [J] | 1 < J < JMAX — 1 }. 

In principle, we could remove loop iterations by adding if statements to the 
affected loop bodies, but this would introduce excessive overhead. Instead, we 
use the code generation facilities of the Omega Library to produce efficient code 
for such loop nests D2|. 

3 Dead Code Elimination 

We generalize the technique of dead code elimination US! by searching not for 
statements that produce a value that is never used, but for iterations that do 
so. We define as live any execution of a statement that is “obviously live” (i.e.. 
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produces a result of the procedure or may have a side effect) or may have dataflow 
to some live statement execution. 

iterations{X) := obviously Jive{X) domain{Sx^Y) (1) 

Sx-,Y 



For example, we can remove the last iteration of Statement S, as we saw in 
Section 

Our union operation is currently somewhat less effective at detecting re- 
dundancy than our subtraction operation. In practice, our code generation al- 
gorithms may work better if we reformulate Equation ^ to subtract, from the 
non-obviously-live iterations, any iterations that are not in the domain of any 
dataflow relation. 

As we remove iterations, we must update the dataflow information involving 
that statement, possibly causing other iterations of other statements to become 
dead. However, we can’t just to apply Equation H and adjust the dataflow in- 
formation until we reached a fixed point. Our analysis eliminates executions of 
statements, not entire statements, so (in the absence of a bound on the number 
of loop iterations) an iterative application may not terminate. 

When there is no cycle of dependences, we simply topologically sort the set 
of statements (ordering them such that all iterations of each statement depend 
only on prior statements), and remove dead iterations in a single traversal of 
this list (working from the results back to the start of the procedure). The above 
algorithm is equivalent to first finding all transitive value-based flow dependences 
to obviously live iterations, and then removing iterations that are not in the 
domain of these dependences. However, we expect that most code will not have 
many dead iterations, so finding all transitive dependences to perform dead code 
elimination would usually be a waste of time. 

For code that does contain a dependence cycle (such as Figure 0), we produce 
a safe approximation of the sets of dead iterations by handling each dependence 
cycle as follows: If there are any dataflow edges from the cycle to any live it- 
eration, or any obviously live iterations in the cycle, we mark all iterations of 
the cycle as live; otherwise, they must be dead. In principle, we could also use 
transitive dependence calculations within each cyclic component, but we fear the 
overhead of doing so would be prohibitive. 

4 Constant Propagation 

We generalize the notion of constant by defining as constant those statement 
executions for which the value produced is a function only of the surrounding 
loop indices and the set of symbolic constants. This includes traditional scalar 
constant expressions (which are constant functions of the loop indices), and all 
executions of all assignments in Figure ^ but not statements that read input. 

We represent these constants with mappings from the symbolic constants and 
indices of the surrounding loops to the value computed. When these mappings are 
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affine, and the values produced are integral, these mappings can be manipulated 
by the Omega Library. For example, as we saw in Section O 

value(S) = { [J] ^ [J + 1] I 1 < J < JMAX }. 

Note that these relations may include conditions under which the values are 
defined, as in 

value{T) = { [] ^ [JMAX] \ }. 

(We have abbreviated PERI DC as P to make the presentation that follows 
more concise). 

We can extend this approach to handle a finite set of values of other types 
by keeping a table of constant values and storing a mapping from loop indices 
to indices into this table. For example, we indicate that the first definition of 
Figure ^b) produces the value by storing this real number constant at the 
next available index in our constant table (for example, 0), and generating a 
relation mapping to this index (such as { [] ^ [0] }). We also flag such relations 
to ensure that we never treat the table index itself as the value of the expression. 



4.1 Intra-procedural Constant Propagation 

We initially identify all iterations of some statements as constant via a syntactic 
analysis: This step identifies as constant all iterations of any statement with a 
constant expression or an expression made up of constants and the loop indices. 
Note that we can safely treat as non-constant any constant function that is out- 
side the scope of our expression evaluator. This step identifies all the assignments 
in both parts of Figure 01 as constant, and produces value relations such as the 
ones shown above. Note that, for code with no constants, our analysis stops at 
this step. 

Next, we can identify as constant any executions of reads with incoming 
dataflow arcs from constant statement executions, such as all uses of JPLUS in 
the stepfx routine (shown in Figure|2I). If Figured immediately followed Figure 
mb), the read of JPLUS (J) in Statement V would receive dataflow from all three 
definitions of JPLUS: 

Ss^v = { [J] ^ [N, J] I 1, JLOW <J< JUP, JMAX - lA2<iV<4} 
5t^v = { [] ^ [N, JMAX] I JLOW < JMAX < JUP /\2 < N <4A^P } 
Su^v = { [] ^ [N, JMAX] I JLOW < JMAX < JUP A2<iV<4AP} 

For a use of a variable in a statement X, we generate a mapping from the itera- 
tion in which a read occurs to the constant value read, which we call valueJ,n{X). 
This relation is the union over all incoming dataflow of the inverse of each 
dataflow relation with the value relation for each constant source statement: 

value Jn{X) = u {5y^x) ^ • valueiY) 

Sy^x 



( 2 ) 
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In general, valueJ.n{X) may be defined over a subset of iterations{X), or be 
restricted by constraints on the symbolic constants. In our running example, the 
constant value read in V depends on PERIDC: 

valueJn{V) = 

{ [N, J]^[J+1]\ 1, JLOW <J< JUP, JMAX - lA2<iV<4} 

U { [N, JMAX] [JMAX] I JLOW < JMAX < JUP A2 < N <4A^P } 
U { [N, JMAX] [1] I JLOW < JMAX < JUP A2<N <4AP} 

Since Statement V simply assigns this value to JPl, value{V) = value Jn{V). 

DO 10 J=1,JMAX-1 
JPLUS(J) = J+1 
10 CONTINUE 

JPLUS(JMAX)=1 

DO 20 J=2,JMAX-1 

B(J) = 5*JPLUS(J+1) - 2*JPLUS(J-1) + J 
20 CONTINUE 

Fig. 4. Hypothetical Constant Definition of B 

In general, a constant expression may combine several reads, constants, and 
loop index variables. In this case, we must produce a separate value Jn{Xi) 
for each read i in statement X, and then combine these to produce value(X) 
according to the expression computed by the statement. Consider what would 
happen in the second loop of Figure 01 (Statement R): The read JPLUS(J+1) 
(which we’ll call i?i) has dataflow from the two assignments (Statements P and 

Q)- 

Sp^R^ = { [ J] ^ [ J - 1] I 3 < J < JMAX - 1 } 

Sq^r, = { [] ^ [JMAX - 1] I 3 < JMAX } 

The read JPLUS(J-l) {R 2 ) has dataflow only from the definition in the loop: 
Sp^R^ = { [J] ^ [J + 1] I 1 < J < JMAX -2} 

The value relations are thus 

valueJn(Ri) = { [ J] ^ [ J + 2] | 2 < J < JMAX -2} 

U { [JMAX - 1] ^ [1] I 3 < JMAX } 
valueSn{R 2 ) = { [ J] ^ [ J] | 3 < J < JMAX - 1 } 

We then combine the value relations for each read Xi to produce 
value Jn{X). This combined relation is defined in terms of the value relations 
valueJn(Xi),valueJn{X 2 ), ■■■ as 

valueJn(X) = {[U, I 2 , ■..]^[vi,V 2 , ■■■] \ [h, h, ■■■]^[vi] e valueJn(Xi)} (3) 
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In our example, valueJn(R) maps from J to the values read in the expression, 
i.e. JPLUS(J+1), JPLUS(J-l), and J, as follows: 

valueSn{R) = { [J] ^ [J + 2, J, J]\2 < J < JMAX -2} 

U { [JMAX - 1] ^ [1, JMAX - 1, JMAX - 1] | 3 < JMAX }. 

We then build, from the expression in statement X, a relation expression{X) 
that maps from the values read to the value produced by the statement, e.g. 

expression(R) = { [i?l, R2, i?3] ^ [5i?l — 2R2 + i?3] } 

We can then define the mapping from iterations to constant values by applying 
the expression to the incoming values: 

value(X) = value Jn{X) • expression(X) (4) 

Note that this join of the incoming constant information valueJn(X) with 
expression(X) is a form of symbolic constant folding. In our example, 

value(R) = { [ J] ^ [4 J + 10] | 2 < J < JMAX -2} 

U { [JMAX - 1] ^ [-JMAX + 6] I 3 < JMAX }. 

Note that we can safely use the degenerate case, value(X) = { [J] ^ [V] [ 
false } (i.e. the empty set of constant iterations), for any case in which we are 
unable to detect or represent a constant (e.g. when an expression{X) cannot be 
represented in our system, or when any valuejin(Xi) is non-constant). 

When there are no dependence cycles, we can apply Equations 0 0, and 0 in 
a single pass through the topologically sorted statement graph that we used in 
Section 0 As was the case in dead-code elimination, dependence cycles (such as 
the one in Figure 0) can keep an iterative analysis from terminating. We could, 
in principle, produce approximate results via the transitive closure operation. In 
practice, we simply do not propagate constants within dependence cycles. 

4.2 Interprocedural Constant Propagation 

In the actual ARC2D benchmark, the code for Figures mb) and 0 are in sep- 
arate procedures. However, we may not have interprocedural value-based flow 
dependence relations. 

We therefore perform interprocedural constant propagation on a procedure- 
by-procedure basis. We start with a pass of inter-procedural data flow analysis 
in which no attempt is made to distinguish among array elements (and thus 
standard algorithms for scalars HB| may be employed). We generate a graph in 
which each procedure is a node, and there is an arc from PI to P2 if P2 may read 
a value defined in PI. We break this graph into strongly connected components, 
and topologically sort the components. We process these components in order, 
starting with those that receive no data from any other component. Within each 
component, we process procedures in an arbitrary order. 
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Note that the above ordering may have little effect other than to identify 
one “initialization” procedure, and place it before all others in our analysis. We 
believe this will be sufficient for our purposes, however, as we expect that many 
constants will be defined in such routines. 

Our restriction to a single analysis of each procedure could cause us to miss 
some constants that could, in principle, be identified. For example, assume Pro- 
cedure PI uses A and defines a constant array B, and Procedure P2 uses B and 
defines the constant array A. If one of these arrays cannot depend (even transi- 
tively) on the other, we could extract definition information from each procedure 
and then use it to optimize the other. However, without a study of how such cases 
arise in practice, it is not clear whether it is best to (a) develop a general algo- 
rithm to detect them (this would require interprocedural transitive dependence 
information, though not on an element-by-element basis), (b) detect simple cases 
(for example, if either A or B can be syntactically identified as a constant), or 
(c) ignore them. 

For each procedure, we describe any constant values stored in each array 
with a mapping from subscripts to values stored in the array. These mappings 
can be generated by composing our iteration-space based information with re- 
lations from array subscripts to iterations (these are defined by the subscript 
expressions). For example, our intra-procedural analysis of FigureQJb) produces 
the following mapping for JPLUS: 

{ [ J] ^ [ J -P 1] I 1 < J < JMAX - 1 } 

U { [JMAX] [JMAX] I P } 

U { [JMAX] ^ [1] I }. 

We then use each mapping in our analysis of each procedure that uses the 
corresponding array. When cycles in our graph of procedures prevent us from 
having information about all arrays used in a procedure, we can safely assume 
all elements of an array are non-constant. 

For each use of a constant array, we compose our mapping from subscripts 
to values with the relation from loop iterations to array subscripts, producing a 
relation from iterations to values read. This is exactly the information we derived 
in the second step of Section and we continue the analysis from that point. 

4.3 Using Propagated Constants 

In some cases, it may be profitable to insert the constants detected above into 
the expressions that use them; in other cases, the value of detecting the constants 
may lie in their ability to improve other forms of analysis. 

For example, a computationally intensive loop in the resid subroutine mul- 
tiplies each of four complicated expressions by a value defined in FigureD(a)- 
Substituting 0.0 for A(l) lets us eliminate one of these expressions, reducing 
inner loop computation by about 20% 0. 

Our constant propagation can improve our analysis of the usage of the XYJ 
array in Figure 0 Without constant propagation, the the presence of JPl in 
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the subscript expression prevents us from describing the pattern of uses with 
affine terms. This forces our constraint solver to use algorithms that are more 
expensive and may produce approximate results, which may then interfere with 
other steps such as cache optimization. We can eliminate the non-afhne term, 
and thus produce an exact description of the usage of XYJ, by substituting our 
constant value relation for JPl during the analysis of the XYJ array (we do 
not actually replace JPl in the program). Note that stepfx is one of the most 
computationally intensive routines in the ARC2D benchmark, taking about one 
third of the total run time m, so optimization of these loops is critical for good 
performance. 

Our ability to use information about the values of a conditional expression 
to eliminate some or all executions of the controlled statement corresponds to 
traditional unreachable code elimination. 

5 Copy Propagation and General Forward Snbstitution 

For the constant functions described in the previous section, it is always legal to 
insert a (possibly conditional) expression that selects the appropriate constant, 
even though it may not always be beneficial. Extending our system for more 
general forward substitution requires a system for determining whether the ex- 
pression being substituted has the same value at the point of substitution, and 
a way of estimating the profitability of forward substitution. The latter requires 
a comparison of costs of evaluating the expression at the point of use vs. com- 
puting and storing the value and later fetching it from memory. The cost of 
the latter is hard to estimate, as subsequent locality optimization can greatly 
affect it. We have therefore not made a detailed exploration of general forward 
substitution. 
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(old [i-1] +2*old[i] +old [i+1] ) ; (old [i-1] +2*cur [i] +cur [i+1] ) ; 

> } 

Fig. 5. Three Point Stencil before and after Partial Copy Propagation 



We have, however, made use of copy propagation as an integrated part of 
cache optimization pa oa Copy propagation may simplify the task of cache 
optimization by eliminating a “copy” loop from an imperfect loop nest such as 
the stencil shown at the left of Figure El Such cases, in which the first of two 
loop nests is a simple array copy, can be detected syntactically. We then find 
the mapping from the subscript expressions of the destination array (old) to 



108 



David Wonnacott 



those of the source array (cur), assuming these expressions are affine (if not, we 
simply do not perform copy propagation). We can then apply this mapping to 
the subscript expressions in each use of the destination array, to find equivalent 
subscripts for the original array in the new context. For example, we replace 
old[i+l] with cur[i+l] in Figure^by applying the subscript mapping of the 
copy statement (simply [j] ^ [j] | 0 < j < — 1) to the subscript expression 

(* + !)■ 

Two conditions must be met for this substitution to be legal: First, the 
domain of the subscript mapping for the copy must contain the subscripts at the 
point of use (in our example, [j] | 0 < j < — 1 contains [f + 1] | 1<*<A^ — 2). 

However, if this does not hold, we could perform copy propagation for only the 
iterations where it is true. For example, if the original program set old [0] and 
old [N-1] before the t loop and ran i from 1 to N—2, we could perform this copy 
propagation for all but the last iteration of the j loop. We could then propagate 
the value of old [N-1] separately for this last iteration. 

Secondly, the substitution must not affect the array dataflow (as this could 
change the result of the calculation) . We can test this by comparing the original 
dataflow to the dataflow that would occur from the right hand side of the copy 
to the potentially propagated array use. In our example, if we replace old[i+l] 
with cur [i+1] , and compute the flow of value that would occur if the copy loop 
had written cur[j] rather than read it, we get the original dataflow relation 
for the use of old [i+1] . The same is true for the replacement of old[i] with 
cur[i], but this test fails for old[i-l], since cur[i-l] would read the value 
from the previous iteration of the i loop, rather than the j loop nest. Thus, this 
form of copy propagation only lets us produce the code on the right of Figure]^ 



for (int t = 0; t<T; t++) 

for (int i = 1; i<=N-2; i++) 
cur[t’/.2][i] = 0.25 * 

(cur[(t-l)y.2] [i-l]+2*cur[(t-l)7.2] [i] +cur [(t-l)7.2] [i+1]); 

> 

if (ty.2 == 0) 

. . . use cur [0] [*] 
else 

. . . use cur [1] [*] 



Fig. 6. Three Point Stencil after Partial Expansion and Full Copy Propagation 



To produce the full copy propagation shown in Figure El we must find a way 
to retain the original dataflow after the propagation. Since the corruption of the 
dataflow is purely the result of overwriting the value we need, there must be 
some memory transformation (such as renaming or full array expansion) that 
will produce this effect. Once all uses of the copy array have been eliminated, 
the copy loop itself can be removed, allowing the use of techniques that require 
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perfect loop nests (such as traditional skewing and tiling) . Our full technique for 
producing a new memory mapping that is not prohibitively wasteful of memory 
(as full array expansion would be) is beyond the scope of this paper. However, 
even the simple expansion by a factor of two shown in Figure El followed by 
skewing and tiling, can more than double execution speed M- 

This copy propagation algorithm can also be used to extend our constant 
propagation system to handle cases in which the constant value is not an affine 
function of the loop indices (in this case, we do not have to worry about inter- 
vening writes to the (constant) array). 

Full forward substitution of more complex array expressions is a straight- 
forward extension of our copy propagation algorithm. It simply requires the 
calculation of new subscript expressions, and the elimination of any intervening 
writes, for each array used in the expression. 



6 Related Work 

Our algorithms are more general than traditional scalar optimizations in that 
they handle individual executions of statements and elements of arrays. However, 
they are less general in that they are “pessimistic” rather than “optimistic” . As 
we have not found any cases in which optimistic analysis is needed for arrays, 
we suspect such a technique would be of purely academic interest. 

There are a number of systems for performing analysis of the flow of values 
in array variables HI 00 El El El El El In principle, the results of any of these 
systems could be used as the basis for extensions of traditional scalar dataflow- 
based optimizations. However, only the work of Sarkar and Knobe |E| addresses 
this possibility. Their array SSA form lets them represent a finite set of known 
constant values, and thus is unable to handle our example Figure Q)b) (in which 
the number of constant values is not known at compile time). 

Kelly, Pugh, Rosser, and Shpeisman developed the transitive closure algo- 
rithm for the Omega Library HU. They give a number of applications of this 
operation, including the calculation of transitive data dependences. They also 
show that transitive closure can be used to perform scalar induction variable 
detection or even “completely describe the effect of arbitrary programs” PH 
Sections. 4]. However, this is used to show “that transitive closure cannot always 
be computed exactly” , and does not seem to be a serious proposal for induction 
variable detection. It is interesting, though not necessarily useful, to view the 
pessimistic and optimistic formulations of traditional constant propagation as 
increasingly accurate approximations of this transitive closure calculation (re- 
stricted to cases in which all iterations of each statement must be identified as 
either constant or potentially non-constant). 

Many of our algorithms are simply efficient ways to approximate a result 
that could be achieved in a straightforward fashion from the set of all transitive 
dependences. However, Pugh et al. do not discuss the application of transitive 
closure for these optimizations, and it would almost certainly be prohibitively 
expensive to perform them by computing all transitive dependences HU. 
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7 Efficiency 

While we do not have extensive experience applying these optimizations, our ex- 
perience performing dependence analysis with the Omega Library indicates that 
the operations that we need to perform should run efficiently on the relations 
that we generate, as long as we can avoid transitive closure. For example, we 
used the “time” operation of the Omega Calculator (version 1.10) to measure 
the amount of time required for a Linux workstation with a 400 MHz Celeron 
processor to produce the value relation for the definition of JPl, as discussed 
at the start of Section EH This combination of three inverse operations, three 
join operations, two union operations, and final simplification and redundancy 
checking took under a millisecond. (The exact time depends on how much simpli- 
fication and redundancy checking is performed; it varied from 0.2ms to 0.6ms.) 
It may be the case that the transitive closure calculations would also run quickly, 
but our experience with this operation is much more limited. Sarkar and Knobe 
0 do not give timing results for their analysis, so we are unable to compare our 
execution time to theirs. 

8 Conclusions 

The availability of information about the flow of values in array variables en- 
ables a variety of analysis and optimization techniques that had previously been 
applied only to scalars. These techniques are most effective when they are up- 
dated to work with individual executions of statements rather than treat each 
statement as an atomic object. 

We have presented algorithms for dead-code elimination, constant propaga- 
tion (which can be used for constant folding, detection of constant conditions, 
and unreachable code elimination), and forward substitution (including copy 
propagation), that are based on such definitions. Our algorithms are exact for 
code that is free of dependence cycles and has affine control flow and subscript ex- 
pressions. For such code, our dead code elimination algorithm will eliminate any 
iterations that produce values that are not used, and our constant propagation 
algorithm will identify as constant any statement executions that produce values 
that we can represent as constant functions of the surrounding loop indices and 
symbolic constants. For code with dependence cycles, we provide approximate 
results, to avoid computing transitive dependences. 

This work is supported by NSF grant CCR-9808694. 

References 

[1] Paul Feautrier. Dataflow analysis of scalar and array references. International 
Journal of Parallel Programming, 20(l):23-53, February 1991. 

[2] Dror E. Maydan, Saman P. Amarasinghe, and Monica S. Lam. Data dependence 
and data-flow analysis of arrays. In 5th Workshop on Languages and Compilers for 



Extending Scalar Optimizations for Arrays 111 



Parallel Computing (Yale University tech, report YALEU/DCS/RR-915), pages 
283-292, August 1992. 

[3] Peng Tu and David Padua. Array privatization for shared and distributed memory 
machines. In Workshop on Languages, Compilers, and Run-Time Environments 
for Distributed Memory Multiprocessors, September 1992. 

[4] William Pugh and David Wonnacott. An exact method for analysis of valne- 
based array data dependences. In Proceedings of the 6th Annual Workshop on 
Programming Languages and Compilers for Parallel Computing, volume 768 of 
Lecture Notes in Computer Science. Springer- Verlag, Berlin, August 1993. Also 
CS-TR-3196, Dept, of Computer Science, University of Maryland, College Park. 

[5] Junjie Gu, Zhiyuan Li, and Gyungho Lee. Symbolic array dataflow analysis for ar- 
ray privatization and program parallelization. In Supercomputing ’95, San Diego, 
Ca, December 1995. 

[6] Denis Barthou, Jean-Frangois Collard, and Paul Feautrier. Fuzzy array dataflow 
analysis. Journal of Parallel and Distributed Computing, 40:210-226, 1997. 

[7] William Pugh and David Wonnacott. Constraint-based array dependence anal- 
ysis. ACM Trans, on Programming Languages and Systems, 20(3):635-678, May 
1998. http : //www. acm. org/pubs/citations/journals/toplas/ 1998-20-3/ 
p635-pugh/. 

[8] Vivek Sarkar and Kathleen Knobe. Enabling sparse constant propagation of array 
element via array SSA form. In Static Analysis Symposium (SAS’98), 1998. 

[9] M. Berry et al. The PERFECT Club benchmarks: Effective performance evalu- 
ation of supercomputers. International Journal of Supercomputing Applications, 
3(3):5-40, March 1989. 

[10] Wayne Kelly, Vadim Maslov, William Pugh, Evan Rosser, Tatiana Shpeis- 
man, and David Wonnacott. The Omega Library interface guide. Tech- 
nical Report CS-TR-3445, Dept, of Computer Science, University of Mary- 
land, College Park, March 1995. The Omega library is available from 
http://www.cs.umd.edu/projects/omega. 

[11] Wayne Kelly, William Pugh, Evan Rosser, and Tatiana Shpeisman. Transitive 
closure of infinite graphs and its applications. International J. of Parallel Pro- 
gramming, 24(6):579-598, December 1996. 

[12] Wayne Kelly, William Pugh, and Evan Rosser. Code generation for multiple map- 
pings. In The 5th Symposium on the Frontiers of Massively Parallel Computation, 
pages 332-341, McLean, Virginia, February 1995. 

[13] Steven S. Muchnick. Advanced Compiler Design and Implementation. Morgan 
Kaufmann Publishers, Inc., 1997. 

[14] R. Eigenmann, J. Hoeflinger, Z. Li, and D. Padua. Experience in the automatic 
parallelization of 4 Perfect benchmark programs. In Proceedings of the fth Work- 
shop on Programming Languages and Compilers for Parallel Computing, August 
1991. Also Technical Report 1193, CSRD, Univ. of Illinois. 

[15] David Wonnacott. Achieving scalable locality with time skewing. In prepa- 
ration. A preprint is available as http://www.haverford.edu/cmsc/davew/ 
cache-opt/tskew.ps, and parts of this work are included in Rutgers University 
CS Tech Reports 379 and 378., 2000. 

[16] David Wonnacott. Using time skewing to eliminate idle time due to memory 
bandwidth and network limitations. In Proceedings of the 2000 International 
Parallel and Distributed Processing Symposium, May 2000. 

[17] Evan Rosser. Personal communication, March 2000. 




Searching for the Best FFT Formulas with the 

SPL Compiler’^ 



Jeremy Johnson^, Robert W. Johnson^, David A. Padua^, and Jianxin Xiong^ 



^ Drexel University, Philadelphia, PA 19104, 
j j ohnsonSmcs . drexel . edu 
^ MathStar Inc., Minneapolis, MN 55402, 
rwjSmathstar . com 

® University of Illinois at Urbana-Champaign, Urbana, IL 61801, 
{padua, jxiong}@cs .uiuc . edu 



Abstract. This paper discuss an approach to implementing and opti- 
mizing fast signal transforms based on a domain-specific computer lan- 
guage, called SPL. SPL programs, which are essentially mathematical 
formulas, represent matrix factorizations, which provide fast algorithms 
for computing many important signal transforms. A special purpose com- 
piler translates SPL programs into efficient FORTRAN programs. Since 
there are many formulas for a given transform, a fast implementation 
can be obtained by generating alternative formulas and searching for the 
one with the fastest execution time. This paper presents an application 
of this methodology to the implementation of the FFT. 



1 Introduction 

This paper discusses an approach to implementing and optimizing fast signal 
transforms, such as the fast Fourier transform (FFT), which can be expressed 
as matrix factorizations. The approach relies on a domain-specific computer lan- 
guage, called SPL, for expressing matrix factorizations, and a special purpose 
compiler which translates matrix factorizations expressed in SPL into efficient 
FORTRAN programs. The factorization of a matrix into a product of sparse 
structured matrices leads to a fast algorithm to multiply the matrix with an 
arbitrary input vector, and the SPL compiler translates such a factorization 
into an efficient program to perform the matrix- vector product. Since SPL pro- 
grams are mathematical formulas involving matrices, it is possible to transform 
SPL programs into equivalent programs through the use of theorems from linear 
algebra. Moreover, given a matrix, it is possible, through the application of var- 
ious mathematical transformations, to automatically generate a large number of 
mathematically equivalent SPL programs applying the given matrix to a vector. 
This allows us, through the use of the SPL compiler, to systematically search for 
the SPL program with the fastest execution time. When the matrix represents 
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a signal transform, such as the FFT, this procedure allows us to find a fast im- 
plementation of the signal transform. The only difficulty with this approach is 
that the size of the search space grows exponentially and therefore more refined 
search techniques than exhaustive search must be utilized. This paper presents 
an application of this methodology to the implementation and optimization of 
the FFT. Various techniques are presented that allow us to find nearly opti- 
mal SPL programs while only searching through a small subspace of the allowed 
programs. 

The approach presented in this paper was first outlined in and further 
discussed in |2j. Our work is part of a larger effort, called SPIRAL (I^) foi' 
automatically implementing and optimizing signal processing algorithms. This 
paper presents the first efficient implementation of the SPL compiler and hence 
this first true test of the methodology. 

Our approach is similar to that used by FFTW El 0, however, FFTW is 
specific to the FFT, while our use of SPL allows us to implement and optimize a 
far more general set of programs. FFTW also utilizes a special purpose compiler 
and searches for efficient implementations. However, their compiler is based on 
built-in code sequences for a fixed set of FFT algorithms and is only used for 
generating small modules of straight-line code, called codelets. Larger FFTs are 
performed using a recursive program which utilizes codelets of different sizes in 
the base case. A search over a subset of recursive FFT algorithms is performed, 
using dynamic programming, to find the recursive decomposition and set of 
codelets with the best performance. Our approach, even when restricted to the 
FFT, considers a much larger search space and uses search techniques other than 
dynamic programming. Similar to the work in FFTW, is the package described 
in 0 for computing the Walsh-Hadamard transform (WHT). This work is closer 
in spirit to the work in this paper, in that the different algorithms considered are 
expressed mathematically and the search is carried out over a space of formulas. 
Similar to FFTW, the WHT package is restricted to a specific transform, and 
a code generator restricted to the WHT is used rather than a compiler. Finally, 
in addition to searching over a space of mathematical formulas, this paper also 
considers a search over a space of compiler options and techniques. In particu- 
lar, we allow compilation of SPL programs to perform various amounts of loop 
unrolling, and include in our search different amounts of unrolling. 

Related to our approach is the ATLAS project, which uses empirical search 
techniques to automatically tune high performance implementation of the Basic 
Linear Algebra Subroutines (BLAS) The application of search techniques in 
determining appropriate compiler optimizations has been used in several efforts. 
Kisuki and Knijnenberg |2j presented similar ideas using the term “iterative com- 
pilation” . They experimented with loop tiling, loop unrolling and array padding 
with different parameters, as well as different search algorithms. Their results 
appeared to be good compared with static selection algorithms. One difference 
between their work and ours is that our search is carried on both different input 
programs and different compilation parameters, thus provides more flexibility. 
Massalin EDI presented a “superoptmizer” which tries to find the shortest ma- 
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chine code sequence to compute a function. They carried out an exhaustive 
search within the space of a subset of the machine instruction set to ensure 
that the final result is optimal. Since the size of their search space increases 
exponentially, the method only works for very short programs. 

In the remainder of the paper we review the SPL language, the SPL compiler, 
SPL formula generation, and apply search techniques to systematically search 
for the optimal FFT implementation. 

2 Matrix Factorizations and SPL 

Many digital signal processing (DSP) transforms are mathematically given as a 
multiplication of a matrix M (the transform) with a vector x (the sampled sig- 
nal). Examples are the discrete Fourier transform (DFT), trigonometric trans- 
forms, and the Hartley and Haar transforms, ng. A fast algorithm for these 
transforms can be given by a factorization of M into a product of sparse ma- 
trices, [QIH. For example, the well-known Cooley/Tukey algorithm, 0, also 
known as fast Fourier transform (FFT), for computing an rs-point DFT, F^.^, 
can be written as 



F^^ = (Fr 0 L)T(I,. (g) F,)L (1) 

where (g), denotes the tensor product, L is the identity matrix, T a diagonal 
matrix, and L a permutation matrix both depending on r and s. 

SPL 0 is a domain-specific programming language for expressing and im- 
plementing matrix factorizations. It was originally developed to investigate and 
automate the implementation of FFT algorithms p. As such, some of the built- 
in matrices are biased towards the expression of FFT algorithms, however, it is 
important to note that SPL is not restricted to the FFT. It contains features that 
allow the user to introduce notation suitable to any class of matrix expressions. 

This section reviews some notation for expressing matrix factorizations and 
summarizes the SPL language. 

2.1 Matrix Factorizations 

In this section we show how a matrix factorization of the n-point DFT matrix, 
leads to a program to compute the matrix- vector product of applied to 
a vector. 

The n-point Fourier Transform is the linear computation 

n—1 

2/(0 = XI ^nX{k) 0<l<n (2) 

k^O 

where Computation of the Fourier transform can be represented 

by the matrix-vector product y = FnX, where Fn is the nxn matrix [w^^]o<z,fe<n- 
Computing the n-point Fourier transform as a matrix- vector product requires 
O(n^) operations. A faster algorithm can be obtained by factoring into a 
product of simpler matrices. 
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Theorem 1. Let N = 2m. Then 



FnLZ = 



F W F 

^ m m 

p _w p 

^ m ** m 



( 3 ) 



where Wm = diag{l,LON, . . . , and is the permutation that rearranges 

the elements of a vector into m segments containing elements of the input sep- 
arated by a stride of m and with the i-th segment starting with the i-th element 
of the input. 



This block decomposition can be written compactly using the tensor product 
(also called Kronecker product). Let A be an m x m matrix and B a, nxn matrix 
then the tensor product of A and B, A® B, is the mn x mn matrix defined by 
the block matrix product 



A® B — [dijB]^^. .^,^ 



ai^iB • • 




^m,lB . . 





( 4 ) 



Corollary 1. 



F2m = {F2 ® Im)T^{h ® F^)Lf-, (5) 

where = diag{Ijn,Wm) ■ 

More generally, 

Theorem 2 (Cooley- Tukey). 



Frs = {Fr ® QTnir ® Fs)L;f ( 6 ) 

where L'^/ is a permutation matrix and TJ® is a diagonal matrix. Let e", 0 < 
i < n, be a vector of size n with a one in the i-th location and zeroes elsewhere, 
and let ujrs be a primitive rs-th root of unity. Then Tf‘^fe'1 ® e®) = (et ® e|) 
and L(:®(eC®e®) = {,e^^®el). 

For example, the 4-point DFT can be factored like the following: 



Fi 



'll 1 1 ■ 




'10 1 O' 




'1 0 0 0' 




'1100' 




'1 0 0 0' 


1 i —1—i 




0 10 1 




0 10 0 




1-10 0 




0 0 10 


1-1 1 -1 




10-10 




0 0 10 




0 0 11 




0 10 0 


1 — i — 1 i 




0 10-1 




0 0 0 i 




0 0 1-1 




0 0 0 1 



An algorithm to compute y = F^x, where N = 2m, can be obtained using 
the previous factorization. The output y is obtained using the factorization Fpf = 
{F 2 ® Im)Tff{l 2 ® Fm)L 2 by first applying to the input x, then applying 
{I 2 ® Fm) to the resulting vector x, and so on. 
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2.2 SPL 

A SPL program is essentially a sequence of mathematical formulas built up 
from a parameterized set of special matrices and algebraic operators such as 
matrix composition, direct sum, and the tensor product. The SPL language uses a 
prefix notation similar to lisp to represent formulas. For example, the expressions 
(compose A B) and (tensor A B) correspond to the matrix product A - B and 
the tensor product A®B respectively. The language also includes special symbols 
such as (F n) and (I m) to represent the discrete Fourier transform matrix Fn 
and the identity matrix Im- In addition to the built-in symbols the user can 
assign a SPL formula to a symbol to be used in other expressions. Moreover, 
new parameterized symbols can be defined so that SPL programs can refer to 
other sets of parameterized matrices. 

For example, the Cooley- Tukey factorization of Frs for any integer values r 
and s is represented by the SPL expression 

(compose (tensor (F r) (I s))(T rs s) (tensor (I r) (F s))(L rs r)) 

Note that all SPL programs represent linear computations of a fixed size. Ex- 
pressions with parameters such as the one above are used to instantiate families 
of SPL programs. 

The power of SPL for expressing alternative algorithms is illustrated by the 
following list of formulas corresponding to different variants of the FFT, inmni 
FR] . Each of the formulas was obtained using the Cooley-Tukey factorization 
and elementary properties of the tensor product. In the following formulas, the 



symbol R 2 IC denotes the fc-bit bit-reversal permutation. 

Apply Cooley /Tukey inductively 

Fs = {F 2 0 h)T!{l2 O Fi)Ll (7) 

Recursive FFT 

Fs = {F2 0 / 4 ) r |(/2 0 {{F2 0 l2)T^{h 0 F2)L\))LI ( 8 ) 

Iterative FFT (Cooley/Tukey) 

Fs = {F 2 ® l4)Tf{l2 ® F 2 ® I 2 ){l 2 ® 2^2 )(-^4 ® F 2 )Rs ( 9 ) 

Vector FFT (Stockham) 

Fs = {F 2 ® h)TlLl{F2 ® h){T^ ® h){L\ ® h){F2 ® h) (10) 

Vector FFT (Korn/Lambiotte) 

Fs = {F 2 0 h)TfLl{F2 ® h){T^ ® h)Ll{F2 ® h)LlRs ( 11 ) 

Parallel FFT (Pease) 



Fs = Llih 0 F2)LlTlLlL\{h ® F2)LI{T* 0 l2)LlLl{h ® F2)Rs (12) 

These formulas are easily translated into SPL programs. The following SPL pro- 
gram corresponds to the formula for the iterative FFT on 8 points. 
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(define R8 (permutation (04261537))) 

(compose (tensor (F 2) (I 4)) (T 8 4) 

(tensor (I 2) (F 2) (I 2)) (tensor (I 2) (T 4 2)) 
(tensor (I 4) (F 2)) R8) 

In general, a SPL program consists of the following constructs. This list refers 
to SPL 3.01 (see 0 for future enhancements and the most current version). 

(1) matrix operations 

(tensor formula formula . . .) 

(compose formula formula ...) 

(direct_sum formula formula . . .) 

(2) direct matrix description: 

(matrix (all al2 ...) (a21 a22 ...) ... ) 

(diagonal (dl d2 ...)) 

(permutation (pi p2 ...)) 

(3) parameterized matrices : 

(1 n) 

(F n) 

(T mn n) or (T mn n, start : step : end) 

(L mn n) 

(4) parameterized scalars: 



W(m n) 


; e~ (2*pi*i*n/m) 


WR(m n) 


; real part of W(m n) 


WI (m n) 


; imaginary part of W(m n) 


C(m n) 


; cos(pi*n/m) 


S(m n) 


; sin(pi*n/m) 



(5) symbol definition: 

(define name formula) 

(template formula (i-code-list) ) 

Templates are used to define new parameterized matrices and by the SPL 
compiler to determine the code for different formulas in SPL. Templates are 
defined using an language independent syntax for code called i-code. 

3 SPL Compiler 

The execution of the SPL compiler consists of six stages: (1) parsing, (2) se- 
mantic binding, (3) type control, (4) optimization, (5) scheduling, and (6) code 
generation. The parser builds an abstract syntax tree (AST), which is then con- 
verted to intermediate code (i-code) using templates to define the semantics of 
different SPL expressions. The i-code is expanded to produce type dependent 
code (e.g. double precision real or complex) and loops are unrolled depending on 
compiler parameters. After intermediate code is generated, various optimizations 
such as constant folding, copy propagation, common sub-expression elimination, 
and algebraic simplification are performed. Optionally, data dependence analysis 
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is used to rearrange the code to improve locality. Finally, the intermediate code 
is converted to FORTRAN, leaving machine dependent compilation stages (in 
particular register allocation and instruction scheduling) to a standard compiler. 
(The code generator could easily be modified to produce C code or even native 
assembly code directly instead for FORTRAN.) Different options to the com- 
piler, command-line flags or compiler directives, control various aspects of the 
compiler, such as the data types, whether loops are unrolled, and whether the op- 
tional scheduler is used. Some compiler optimizations and instruction scheduling 
can also be obtained at the SPL level, by transforming the input formulas. 

The nodes in the AST corresponding to a SPL formula can be interpreted as 
code segments for computing a specific matrix-vector product. Code is generated 
by combining and manipulating these code segments, to obtain a program for the 
matrix-vector product specified by the SPL program. For example, a composition 
A • i? is compiled into the sequence t = B ■ x;y = A - 1 mapping input vector x 
into output vector y using the intermediate vector t. In the same way, the direct 
sum compiles into operations acting on parts of the input signal in parallel. The 
tensor product of code sequences for computing A and B can be obtained using 
the equation A® B = L™”(/„ ® A)L'^^{Im ® B). 

For example, the code produced for the SPL program corresponding to the 
4-point Cooley /Tukey algorithm is 



subroutine F4(y,x) 


f3 


X 

to 

1 


implicit complex*16 (f ) 


f4 


(0d0,-ld0)*f3 


implicit integer (r) 


y(i) = 


fO + f2 


complex*16 y(4) ,x(4) 


y(3) = 


K) 

O 

1 

l-b 

to 


fO = x(l) + x(3) 


y(2) = 


fl + f4 


fl = x(l) - x(3) 


y(4) = 


fl - f4 


f2 = x(2) + x(4) 


end 





In this example, two compiler directives were added: one giving the name F4 
to the subroutine and one causing complex arithmetic to be used. Looking at this 
example, one already sees several optimizations that the compiler makes (e.g. 
multiplications by 1 and -1 are removed). More significantly, multiplication by 
the permutation matrix is performed as re-addressing in the array accesses. 
Another important point is that scalar variables were used for temporaries rather 
than array elements. This has significant consequences on the FORTRAN com- 
piler’s effectiveness at register allocation and instruction scheduling. 

Changing the code type to real, #codetype real, breaks up complex num- 
bers into real and imaginary parts which gives the chance for further optimiza- 
tions. In the case above the (complex) multiplication vanishes. 



subroutine F4(y,x) 
implicit real*8(r) 
real*8 y(8) ,x(8) 
rO = x(l) + x(5) 

rl = x(2) + x(6) 



r7 


x(4) 


- x(8) 


y(i) = 


rO + 


r4 


y(2) = 


rl + 


r5 


y(5) = 


rO - 


r4 


y(6) = 


rl - 


r5 
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r2 


= x(l) - x(5) 


y(3) = 


r2 


+ r7 


r3 


= x(2) - x(6) 


y(4) = 


r3 


- r6 


r4 


= x(3) + x(7) 


y(7) = 


r2 


- r7 


r5 


= x(4) + x(8) 


y(8) = 


r3 


+ r6 


r6 


= x(3) - x(7) 


end 







In the previous example, we produced straight-line code. The SPL compiler is 
also capable of producing code with loops. For example, the formula In® A has a 
straight-forward interpretation as a loop with n iterations, where each iteration 
applies A to a segment of the input vector. The SPL compiler is instructed to 
generate code using this interpretation by the following template in which ANY 
matches any integer and any matches any SPL expression. 

; (tensor Im An) parameters: 

; p.O: mn, p . 1 : size(I)=m, p.2:ANY=m, p . 3 : size (A)=n, p.4:A 
(template (tensor (I ANY) any) 

(r . 0 = p . 3-1 
do p . 1 

y(0:l:r.0 p.3) = call p . 4(x(0 : 1 : r . 0 p.3)) 
end) ) 



The following example shows how to use the SPL compiler to combine straight- 
line code with loops using formula manipulation and loop unrolling (loop un- 
rolling is controlled by the compiler directive #unroll or by specifying a com- 
mand line option -Bn which indicates that code for matrices of dimension n 
or less are to be unrolled). Using a simple property of the tensor product, 
hi®F2 = Iz2®{l2®F2)- With the option -R -1-B4, the SPL compiler produces 
straight-line code for loop body which computes {I2 ® F2). 



subroutine 164F2(y,x) 
implicit real*8(f) 
implicit integer (r) 
real*8 y(128) ,x(128) 
do iO = 0, 31 



y(4*i0+l) 

y(4*i0+2) 

y(4*i0+3) 

y(4*i0+4) 



x(4*i0+l) + x(4*i0+2) 
x(4*i0+l) - x(4*i0+2) 
x(4*i0+3) + x(4*i0+4) 
x(4*i0+3) - x(4*i0+4) 



end do 



SPL clearly provides a convenient way of expressing and implementing matrix 
factorizations; however, it can only be considered as a serious programming 
tool, if the generated code is competitive with the best code available. One 
strength of the SPL compiler is its ability to produce long sequences of straight- 
line code. In order to obtain maximal efficiency small signal transforms should be 
implemented with straight-line code thus avoiding the overhead of loop control or 
recursion. Table [D compares code for small FFTs generated by the SPL compiler 
to the FFT codelets of FFTW on a Sun Ultra 5, 333 MHz. 
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n 


1 2 3 4 5 6 


FFTW 
SPL FFT 


0.04 0.07 0.23 0.53 1.50 4.34 
0.03 0.05 0.22 0.50 1.45 4.29 



Table 1. Runtime of FFT(2") in fis, FFTW vs. SPL 



4 SPL Formula Generator 

SPL programs can be derived by applying various mathematical theorems such 
as the Cooley- Tukey theorem. Such mathematical theorems can be thought of 
as SPL program transformations. For a given matrix, it is possible to gener- 
ate many different SPL programs for performing the same mathematical com- 
putation. While mathematically equivalent, different SPL programs may, when 
translated by the SPL compiler, lead to FORTRAN programs with vastly differ- 
ent performance. Optimizing a given linear computation can be thought of as a 
search problem. Simply generate all possible SPL programs, compile them, and 
select the one which leads to the fastest execution time. By “all” possible algo- 
rithms, we mean all of those SPL programs that can be generated by applying a 
fixed set of mathematical rewrite rules. 

In this section we consider a generalization of the Cooley- Tukey theorem 
which allows us to generate many different FFT algorithms. This space of FFT 
algorithms allows us to explore various combinations of iterative and recursive 
FFT algorithms. 

The following theorem is an easy corollary of the Cooley- Tukey theorem. 

Theorem 3. Let iV = 2" and let N = Ni ■ ■ ■ Nt, with Ni = 2"% be a factoriza- 
tion of N. Using the notation N{i) = Ni - ■ ■ Ni, N{i) = N/N{i), N{0) = 1, and 
N{t) = N, Then 

1 

{(-fTV(i-i) C) F^i 0 C) R (13) 

i—t 

where 

t 

( 14 ) 

i=l 

is a generalization of the hit-reversal permutation. 

Applying this factorization recursively to each until the base case equal to 
F 2 leads to an algorithm for computing Fjy. The special case when t = 2 and 
A^i = 2, and N 2 = N/2 leads to the standard recursive FFT. Other recursive 
breakdown strategies are obtained by choosing different values for Ni and N 2 - 
The special case when 7Vi = • • • Nt = 2 leads to the standard iterative, radix two, 
FFT. The space of formulas we consider allows us to explore various amounts of 
recursion and iteration along with different breakdown strategies. 
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Corresponding to this factorization is a partition of the integer n = ni + - ■ ■ + 
Tit- A similar partition is obtained for each and the resulting algorithm can 
be represented as a tree, whose nodes are labeled by the integers n^. The tree, 
called a partition tree, has the property that the value at any node is equal to 
the sum of the values of its children. Partition trees were introduced in 0. 

The number of trees, T„, with root equal to k is given by the recurrence 



Tn 



1 



On— ni H \-rit 



T 

ni 



n = 1 
• • n > 1 



(15) 



The solution to this recurrence is where a = 3+2V2 « 5.83. TableEl 

lists the initial values for Tn- 



n 


^ of formulas 


n 


^ of formulas 


n 


ik of formulas 


n 


^ of formulas 


1 


1 


5 


45 


9 


20793 


13 


13649969 


2 


1 


6 


197 


10 


103049 


14 


71039373 


3 


3 


7 


903 


11 


518859 


15 


372693519 


4 


11 


8 


4279 


12 


2646723 


16 


1968801519 



Table 2. Size of the formula space 



Figure 0] shows the distribution of runtime (in microseconds) for all 20793 
formulas for F 29 . All runtimes presented in this paper were obtained on an Ultra 
5 workstation, with a 333MHz Ultralli CPU and 128MB memory. The software 
environment was Solaris 7, SUN Fortran 77 4.0, and SPL Compiler 3.04. The 
left side of Figure Q plots performance in microseconds versus formula and the 
right side shows a histogram the distribution of runtimes. Each bin shows the 
number of formulas with runtimes in a given range. The fastest formula is about 
five times faster than the slowest one and 2.25 times faster than the mean. Also 
observe that only 1.7% of the total formulas, approximately 500 formulas, fall 
into the bin with the highest performance. 

These plots show that there is a wide range in runtime, and that a significant 
gain in performance can be obtained by searching for the formula with the least 
execution time. The only difficulty in this approach to optimization, is that the 
number of formulas considered grows exponentially and hence an exhaustive 
search is not feasible for large values of n. In the next section we will present 
some techniques for reducing the search time. 



5 Search 

In this section we search for the optimal FFT implementation using the SPL 
compiler and the FFT formulas generated in the previous section. Several search 
strategies are applied in an effort to reduce the number of formulas considered. 
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Fig. 1. Execution time(/is) and distribution, n=9, Exhaustive 



while still obtaining a nearly optimal implementation. In addition to searching 
over the space of EFT formulas presented, we also search over several compiler 
strategies related to loop unrolling. 

For each formula, there is a tradeoff between generating straight line code 
and loop code. Straight line code involves fewer instructions than loop code but 
its size increases with N. This leads to performance degradation due to cache 
and memory limitations. Except for small sized problems, partial unrolling is 
probably the best choice. The SPL compiler can generate several versions of 
code with different degrees of loop unrolling. The controlling parameter is called 
the unrolling threshold, which is denoted as in this paper. Given a formula, the 
code generation is carried out top-down over the tree. If the size of a node is less 
than or equal to T„, then the entire subtree rooted at this node is implemented 
by straight-line code; otherwise, the code corresponding to this node will be 
implemented as a loop. 

It is impractical to try all the possible values of T^. Preliminary experiments 
show that both compile time and execution time increase dramatically when Tu 
is greater than 7 on the Ultra 5 workstation. This value could be different on 
other machines or for other problems, but we believe that the upper bound of the 
value of interest can always be identified and is much smaller than N when N is 
large enough. Therefore, in our experiment, only values of < 7 are considered. 
The entire search space is the Cartesian product of the formula space and the 
set of values of T„. 

Since exhaustive search is impractical it is necessary to use a search strategy 
that enables the optimal formula, or a nearly optimal formula, to be found 
without considering such a large space of formulas. As a first step one might 
randomly choose a subset of formulas to consider. However, Fig. 0 suggests that 
this approach is unlikely to find an optimal formula. 

Instead we experimented with several versions of heuristic search based on 
local transformations of formulas. The first approach, called single point search 
(SPS), starts with a random formula and repeatedly finds a neighboring formula 
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while checking to see if the neighbor leads to better performance. The second 
approach, called multiple point search (MPS), starts with a smaller set of for- 
mulas (in our case 4) and searches locally in a neighborhood of each formula in 
the initial set. 

Both approaches rely on a function nexttree to find a “neighbor” to an existing 
formula. The rules for finding a “neighbor” in SPS are shown in FigH The rules 
used by MPS are the same, except R7 (Random) is not used. The “neighbors” 
are similar in their tree structure and, as a result, their performance values tend 
to be close (except when R7 is applied). 



R1 


(Merge) : 


merge two subtrees; 






R2 


(Split) : 


split one node into two; 




R3 


(Resize) : 


resize two subtrees by 


one ; 




R4 


(Swap) : 


swap two subtrees ; 






R5 


(Unrollmore) : 


increase the unrolling 


threshold 


by one 


R6 


(Unrollless) : 


decrease the unrolling 


threshold 


by one 


R7 


(Random) : 


randomly choose another 


' tree; 




R8 


(Chgchild) : 


apply these rules to a 


subtree ; 





Fig. 2. Rules for finding a neighbor 



The two approaches were compared to random search and exhaustive search 
when it was feasible. Data is presented for experiments with n = 9 and n = 16. 
The results for other sizes have similar characteristics. 

The first observation is that different search strategies behave differently. 
Figures 0 to O show the execution time plotted against formulas considered and 
the distribution of runtimes (number of formulas with a given runtime) for ran- 
dom search, SPS, and MPS when n = 9. For random search, the execution time 
graph shows a random pattern and the distribution looks like a normal distri- 
bution. For MPS, the execution time converges to a small range near the best 
and the distribution concentrates on the fast side. It shows that the definition of 
“neighbor” in MPS is effective in controlling the range of the search. For SPS, 
the run-time also shows some kind of random pattern. However, from the dis- 
tribution we can see, the run-times are more concentrated in the fast side. The 
random pattern is due to the use of R7(Random) in finding neighbors. 

The second important observation is that a nearly optimal formula can be 
found while considering substantially fewer formulas than the total size of the 
search space. Recall that the size of the search space is 0(5.83"). The number 
of samples used in SPS and MPS was O(n^). Figure 0 shows at each step the 
best performance obtained since the beginning of the test. This illustrates how 
quickly the search methods converge on a formula with good performance. 
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Fig. 3. Execution time{fxs) and distribution, n=9, Random 
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Fig. 6. Best execution time since the beginning (/is), n=9(left), 16(right) 



n 


SPS 

steps time(ps) 


MPS 

steps time(/rs) 


Random 
steps time(/is) 


Exhaustive 
steps time(/is) 


2 


1 


0.08 


1 


0.08 


1 


0.08 


1 


0.08 


3 


1 


0.22 


7 


0.22 


4 


0.22 


1 


0.22 


4 


10 


0.51 


25 


0.51 


32 


0.51 


4 


0.51 


5 


31 


1.63 


50 


1.63 


103 


1.67 


12 


1.64 


6 


34 


4.75 


112 


4.76 


6 


5.11 


68 


4.65 


7 


25 


14.63 


128 


14.62 


216 


14.80 


364 


14.57 


8 


59 


40.65 


261 


40.57 


11 


66.93 


1604 


39.43 


9 


142 


107.30 


533 


109.76 


158 


113.79 


8459 


107.46 


10 


163 


264.46 


585 


265.15 


181 


299.76 






11 


286 


604.35 


380 


586.11 


96 


633.96 






12 


292 


1443.50 


453 


1601.88 


269 


1602.99 






13 


443 


3007.30 


599 


3222.55 


795 


3418.46 






14 


530 


8567.36 


455 


8411.40 


1038 


9758.44 






15 


625 28892.56 


272 30054.08 


496 31583.56 






16 


1536 79886.40 


1420 75601.14 


1361 84241.70 







Table 3. The best formulas found 



Table El shows the performance of the best formula found by each algorithm 
and the number of steps when it was found. The best formulas found by SPS 
and MPS are very close to the best one found in the exhaustive search (we only 
verified this for n < 9). For the random search, although a pretty good formula 
can be found quickly, the performance of the best formula found is not as good 
as those found by SPS and MPS. The difference gets larger when n increases. 

These experiments show that search techniques can be used to obtain fast 
implementations of signal processing algorithms like the FFT. The mathematical 
approach presented in the paper sets the framework for a systematic search, and 



126 Jeremy Johnson et al. 



while an exhaustive search is infeasible, simple heuristic approaches allow nearly 
optimal implementations to be found in a reasonable amount of time. 
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Abstract. We present results demonstrating the usefulness of mono- 
lithic program analysis and optimization prior to scalarization. In par- 
ticnlar, models are developed for studying nonmaterialization in basic 
blocks consisting of a sequence of assignment statements involving array- 
valued variables. We use these models to analyze the problem of mini- 
mizing the number of materializations in a basic block, and to develop 
an efficient algorithm for minimizing the number of materializations in 
certain cases. 



1 Introduction 

Here, we consider the analysis and optimization of code utilizing operations and 
functions operating on entire arrays or array sections (rather than on the under- 
lying scalar domains of the array elements). We call such operations monolithic 
array operations. 

In j1 tij . Veldhuizen and Gannon refer to traditional approaches to optimiza- 
tion as transformational, and say that a difficulty with “transformational op- 
timizations is that the optimizer lacks an understanding of the intent of the 
code. . . . More generally, the optimizer must infer the intent of the code to apply 
higher-level optimizations.” The dominant approach to compiler optimization of 
array languages, FortranQO, HPF, etc., is transformational optimizations done 
after scalarization. In contrast, our results suggest that monolithic style pro- 
gramming, and subsequent monolithic analysis prior to scalarization, can be used 
to perform radical transformations based upon intensional analysis. Determin- 
ing intent subsequent to scalarization seems difficult, if not impossible, because 
much global information is obfuscated. Finally, many people doing scientific pro- 
gramming find it easier and natural to program using high level monolithic array 
operations. This suggests that analysis and optimization at the monolithic level 
will be of growing importance in the future. 

In this paper, we study materializations of array-valued temporaries, with a 
focus on elimination of array- valued temporaries in basic blocks. 

There has been extensive research on nonmaterialization of array-valued in- 
termediate results in evaluating array-valued expressions. The optimization of 

S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 127- rTTn 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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array expressions in APL entails nonmaterialization of array subexpressions in- 
volving composition of indexing operations [HQOiii Nonmaterialization of 
array-valued intermediate results in evaluating array-valued expressions is done 
in Fortran90, HPF, ZPL (3, POOMA 0, C-|— I- templates the Ma- 

trix Template Library (MTL) 0, active libraries etc. Mullin’s Psi Calculus 
model 1130 provides a uniform framework for eliminating materializations in- 
volving array addressing, decomposition, and reshaping. Nonmaterialization in 
basic blocks is more complicated than in expressions because a basic block may 
contain assignment statements that modify arrays that are involved in potential 
nonmaterializations. Some initial results on nonmaterialization applied to basic 
blocks appear in Roth, et.al. Pin 113. 

In Section 0 we develop a framework, techniques, algorithms, etc., for the 
study of nonmaterialization in basic blocks. We give examples of how code can 
be optimized via nonmaterialization in basic blocks, and then formalize this op- 
timization problem. We then formalize graph-theoretic models to capture fun- 
damental concepts of the relationship between array values in a basic block. 
These models are used to study nonmaterializations, and to develop techniques 
for minimizing the number of materializations. A lower bound is given on the 
number of materializations required. An optimization algorithm for certain cases 
is given. This algorithm is optimal; the number of materializations produced ex- 
actly matches the lower bound. 

In Section 13 we give brief conclusions. 



2 Nonmaterialization in Basic Blocks 

2.1 Examples of Nonmaterialization 

Many array operations, represented in HPF and FortranQO as sectioning and as 
intrinsics such as transpose, cshift, eoshift, reshape, and spread involve the 
rearrangement and replication of array elements. We refer to these operations as 
address shujfling operations. These operations essentially utilize array indexing, 
and are independent of the domain of values of the scalars involved. Often it 
is unnecessary to generate code for these operations. Instead of materializing 
the result of such an operation (i.e. constructing the resulting array at run- 
time), the compiler can keep track of how elements of the resulting array can 
be obtained by appropriately addressing the operands of the address shuffling 
operation. Subsequent references to the result of the operation can be replaced 
by suitably modified references to the operands of the operation. 

As an example of nonmaterialization, consider the following statement. 

X = B + transpose(C) 

Straightforward code for this example would first materialize the result of 
the transpose operation in a temporary array, say named Y, and then add 
this temporary array to B, assigning the result to X . Indeed, instead of using 
a single assignment statement, the programmer might well have expressed the 
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above statement as such a sequence of two statements, forming part of a basic 
block, as follows. 

Y = transpose (C) 

X = B + Y 

A straightforward compiler might produce a separate loop for each of the 
above statements, but an optimizer could fuse the two loops, producing the 
following single loop. 

forall (i = 1:N, j = 1:N) 

Y(i, j) = C(j ,i) 

X(i,j) = B(i,j) + Y(i,j) 
end forall 

However, Y need not be materialized at all, yielding the following code. 

forall (i = 1:N, j = 1:N) 

X(i, j) = B(i, j) + C(j ,i) 
end forall 

Nonmaterialization is also an issue in optimizing distributed computation. 
Kennedy, et. al. 0, Roth El, and Roth, et. al. m developed methods for opti- 
mizing stencil computations by nonmaterialization of selected cshif t operations 
involving distributed arrays. This nonmaterialization was aimed at minimizing 
communications, including both the amount of data transmitted, and the num- 
ber of messages used. The amounts of the shifts in the stencils involved were 
constants, so the compiler could determine which cshift operations in the ba- 
sic block were potential beneficiaries of nonmaterialization. The compiler could 
analyze the basic block and choose not to materialize some of these cshift oper- 
ations. The nonmaterialization technique in pnuiEi used a sophisticated form 
of replication, where a subarray on a given processor was enlarged by adding ex- 
tra rows and columns that were replicas of data on other processors. The actual 
computation of the stencil on each processor referred to the enlarged array on 
that processor. 

For instance, consider the following example, taken from Roth, et. al. m 

(1) RIP = cshift(U,shift=+l,dim=l) 

(2) RIN = cshift(U,shift=-l,dim=l) 

(3) T = U + RIP + RIN 

The optimized version of this code is the following. The cshift operations 
in statements (1) and (2) are replaced by overIap_cshift operations, which 
transmit enough data from U between processors so as to fill in the overlap 
areas on each processor. In statement (3), the references to RIP and RIN are 
replaced by references to U, annotated with appropriate shift values, expressed 
using superscripts. 

(1) call overIap_cshift (U, shift=+l ,dim=l) 

(2) call overIap_cshift (U, shift=-l ,dim=l) 

(3) T = U + 
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2.2 Formulation of Nonmaterialization Optimization Problem for 
Basic Blocks 

We now begin developing a framework for considering the problem of minimizing 
the number of materializations in a basic block (equivalently, maximizing the 
number of shuffle operations that are not materialized) . 

Definition 1. A def— value in a basic block is either the initial value (at the 
beginning of the basic block) of an array variable or an occurrence of an array 
variable that is the destination of an assignment. (An assignment can be to the 
complete array, to a section of the array, or to a single element of the array.) 



Definition 2. A complete— def is a definition to an array variable in which 
an assignment is made to the complete array. A partial— def is a definition to 
an array variable in which an assignment is made to only some of the array 
elements, and the other array elements retain their prior values. An initial— def 
is an initial value of an array variable. 

Note that each def-value can be classified as either a complete-def, a partial- 
def, or an initial-def. For instance, consider the following example. 

Example 1. 

(1) B = cshift(A,shift=+5,dim=l) 

(2) C = cshift(A,shift=+3,dim=l) 

(3) D = cshift(A,shift=-2,dim=l) 

(4) B(i) = 50 

(5) R(2:199) = B(2:199) + C(2:199) + D(2:199) 

A, B, C, and D all dead at end of basic block, R live. 

For convenience, we identify each def-value by the name of its variable, super- 
scripted with either 0 for an initial-def, or the statement assigning the def-value 
for a noninitial-def. Example □ involves initial-defs A° and R°, complete-defs B^, 
C^, and D^, and partial-defs B'^ and R®. 

An important issue for nonmaterialization is determining whether a given 
shuffle operation is a candidate for nonmaterialization. The criteria for a given 
shuffle operation being a candidate is that it be both safe and profitable. The 
criteria for being safe is that the source array of the shuffle is not modified while 
the definition of the destination array is live, and that the destination array is 
not partially modified while the source array is live The criteria for prof- 

itability can depend on the optimization goal, the shuffle operation involved, the 
shape of the arrays involved, architectural features of the computing environment 
on which the computation will be performed, and the distribution/alignment of 
the arrays involved. For instance, Roth na gives criteria for stencil computations 
on distributed machines. For purposes of this paper, we assume the availability 
of appropriate criteria for determining whether a given shuffle operation is a 
candidate for nonmaterialization. 
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Definition 3. An eligible statement is an assignment statement whose right 
side consists of a shuffle operation meeting the criteria for nonmaterialization. 
We say that the def-value occurring on the right side of an eligible statement is 
mergeable with the def-value occurring on the left side of the statement. An 
ineligible statement is a statement that is not an eligible statement. 

In a given basic block, the decisions as to which eligible statements 
should be nonmaterialized are interdependent. Roth HH uses a “greedy” 
algorithm to choose which eligible statements to nonmaterialize. Namely, at 
each step, his algorithm chooses the first eligible statement, and modifies the 
basic block so that this statement is not materialized. Each such choice can 
cause subsequent previously eligible statements to become ineligible. Example E 
illustrates this phenomenom. 

Consider the basic block shown in Example Assume that statements (1), 

(2), and (3) are eligible statements. Roth’s method would first consider the shuffle 
operation in statement (1), and choose to merge B with A. This merger would be 
carried out as shown below, changing the partial-def of B in statement (4) into 
a partial-def of A. In statements (4) and (5), the reference to B is replaced by a 
reference to A, annotated with the appropriate shift value. 

(1) call overlap_cshift(A,shift=+5,dim=l) 

(2) C = cshift(A,shift=+3,dim=l) 

(3) D = cshift(A,shift=-2,dim=l) 

(4) A<+s>(i) = 50 

(5) R(2:199) = A<+^>(2:199) + C(2:199) + D(2:199) 

Note that the new partial-def of A in statement (4) makes statements (2) 

and (3) unsafe, and therefore ineligible, thereby preventing C and D from being 
merged with A. Thus there would be three copies of the array. But, it is better 
to make a separate copy of A in B, and let A, C, and D all share the same copy, 
as follows: 

(1) B = cshift(A,shift=+5,dim=l) 

(2) call overlap_cshift (A, shift=+3,dim=l) 

(3) call overlap_cshift(A,shift=-2,dim=l) 

(4) B(i) = 50 

(5) R(2:199) = B(2:199) + A<+3>(2:199) + A<"2>(2:199) 

The above example generalizes. Suppose that instead of just C and D, there 
were n additional variables which could be merged with A. Roth’s method would 
only merge B with A, resulting in a total of n-l- 1 materialized arrays. In contrast, 
by making a separate copy for B, there would be a total of only two material- 
ized arrays. Consequently, the “greedy” algorithm can be arbitrarily worse than 
optimal. 

The optimization problem we consider is to minimize the number 
of materializations in a basic block, under the following seven assump- 
tions: 
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Assumptions: 

1. No dead code. (No Dead Assumption) 

2. Arrays with different names are not aliasecO. (No Aliasing Assumption) 

3. No rearrangement of the ineligible statements in a basic block. (Fixed Se- 
quence Assumption) 

4. The mergeability relation between def-values in a basic block is symmetric 
and transitive. (Pull Mergeability Assumption) 

5. There is no fine-grained analysis of array indices. Recall that each def-value 
is classified as being either an initial-def, a complete-def or a partial-def. No 
analysis is made as to whether two partial-defs overlap or whether a given 
use overlaps a given partial-def. (Coarse Analysis Assumption) 

6. The initial value of each variable that is live on entry to a basic block is 
stored in the variable’s official location. (Live Entry Assumption) 

7. A variable that is live at exit from a basic block, must have its final value 
at the end of the basic block stored in the variable’s official location. (Live 
Exit Assumption) 

In considering Assumptions |0] and d note that any def-value that is not live 
at entry or exit from a basic block need not be stored in the official location of 
its variable. 



2.3 The Clone Forest Model 

We now consider sets of mutually mergeable def-values, represented as trees in 
the clone forest, defined as follows. 

Definition 4. The clone forest for a basic block is a graph with a node for 
each def-value occurring in the basic block, and a directed edge from the source 
def-value to the destination def-value of each eligible shuffle operation. A clone 
set is the set of def-values occurring in a tree of the clone forest. The root of 
a given clone set is the def-value that is the root of the clone tree eorresponding 
to the clone set. 

As an example, the clone forest for Example 0 is shown in Figured 
Definition 5. A materialization is either an initial-def or a complete-def. 

An important observation is that at least one materialization is needed for 
each clone set whose root is either an initial-def or a complete-def. The overall 
optimization problem can be viewed as minimizing the total number of material- 
izations for all the clone sets for a given basic block. In Example d the only clone 
set with more than one member is {A°, B^, C^, D^}. The basic block can be opti- 
mized by materializing Pf and from this clone set, and materializing from 

^ However, the techniques in this paper can be suitably modified to use aliasing 
information. 
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Fig. 1. Clone forest for Example Q] 

clone set {R°}. Clone sets {B^} and {R®} do not require new materializations; 
they can be obtained by partial-defs to already materialized arrays. 

Note that even if there is a materialization for a given clone set, the materi- 
alized member of the clone set need not necessarily be the root of the clone set. 
This freedom may be necessary for optimization, as illustrated in the following 
example. 

Example 2. 

(1) A = C + D 

(2) B = cshift(A,shift=+3,dim=l) 

(3) z = A(i) + B(i) 

A dead, B live at end of basic block. 

Consider the clone set consisting of A^ and B^. The root, A^, is a complete- 
def, and so at least one materialization is needed for this clone set. Since A is 
dead at the end of the basic block, and B is live, it is advantageous to materialize 
the value of the clone set in variable B instead of variable A. Thus, an optimized 
version of the basic block might be as follows. Here, the partial result C + D is 
annotated by the compiler to indicate that the value stored in variable B should 
be appropriately shifted. 

(!') B = (C + D)<+3> 

(3) z = B<-3>(i) + B(i) 

We now establish a lower bound on the number of separate copies required 
for a given clone set. 

Definition 6. A given def-value in a basic block is transient if it is not the 
last def to its variable in the basic block, and the subsequent def to its variable is 
a partial-def. 

For instance, in Example H def-value B^ is transient. 

Definition 7. A given def-value in a basic block is persistent if it is the last 
def to its variable in the basic block, and its variable is live at the end of the 
basic block. 

For instance, in Example Cl def-value R® is persistent. 
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Theorem 1. Consider the clone tree for a given clone set. There needs to be 
a separate copy for each def-value in the clone tree that is either transient or 
persistent. 

Proof: Two transient def-values cannot share the same copy because of Assump- 
tions n and ISl Two persistent def-values cannot share the same copy because 
of Assumption 0 A transient def-value and a persistent def-value cannot share 
the same copy because of Assumptions Eand □ □ 

2.4 Main Techniques for Nonmaterialization Under Constrained 
Statement Rearrangement 

In this section, we outline our main techniques and results on the nonmateriali- 
zation problem for basic blocks under constrained statement rearrangement. 

First, note that for each transient def-value of a given clone set, there is a 
subsequent partial-def to its variable, and this partial-def is the root of another 
clone set. This relationship can be represented in a graph with a node for each 
clone set, as formalized by the following definition. 

Definition 8. The version forest for a basic block is a graph with a node for 
each clone set in the basic block, and a directed edge from clone set a to clone 
set l3 if the root of clone set (3 is a partial-def that modifies a member of a. The 
root def-value of a given tree in the version forest is the root def-value of the 
clone set of the root node of the version tree. 

As an example, the version forest for Example Q is shown in Figure 0 There 
are two trees in the version forest, with root def-values A° and R°, respectively. 




Fig. 2. Version forest for Example 0 



Definition 9. A node of a version forest is persistent if any of its def-values 
are persistent. 

For instance, in Figure 0 the only persistent node is the node for clone set 

{r5}. 

Definition 10. The origin point of an initial-def is just before the basic block. 
The origin point of a complete-def or a partial-def is the statement in which 
it receives its value. The origin point of a clone set is the origin point of the 
root def-value of the clone set. 
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For instance, in Figure |3 the origin point of {A°, B^, C^, D^} is (0), of {B^} 
is (4), of {R°} is (0), and of {R^} is (5). 

Definition 11. The demand point of a persistent def-value in a basic block 
is just after the basic block. The demand point of a non-persistent def-value 
is the last ineligible statement that contains a use of the def-value, and is (0) if 
there is no ineligible statement that uses it. The demand point of a clone set 
is the maximum demand point of the def-values in the clone set. 

For instance, in FigureEl the demand point of {A°, B^, C^, D^} is (5), of {B'^} 
is (5), of {R°} is (0), and of {R^} is (6). 

Next, we formalize the concept of an essential node of a version tree. Infor- 
mally, an essential node is a node whose value is needed after the values of all its 
child nodes (if any) have been produced. Consequently, the materialization used 
to hold the value of an essential node in order to satisfy this need for the value 
cannot be the same as any of the materializations that are modified to produce 
the values of the child nodes. 

Definition 12. A node of a version tree is an essential node if it is either a 
leaf node, or a non-leaf node whose demand point exceeds the maximum origin 
point of its children. 

For instance, in FigureEl node {A°, B^, C^, D^} is essential because its demand 
point (5) exceeds the origin point (4) of its child {B"^}. Node{R°} is not essential 
because its demand point (0) does not exceed the origin point (5) of its child 
{R®}. Nodes {B'^} and {R®} are leaf nodes, and so are essential. 

Proposition 1. Every persistent node of a version tree is essential. 

Proof: A persistent leaf node is essential by definition. A persistent non-leaf 
node is essential because its demand point is just after the basic block, while the 
origin point of each of its children is within the basic block. □ 

Recall that there are three materializations in the optimized code for Exam- 
ple 0 There is the materialization of A°, which is also used for and D^. There 
is a materialization of transient def-value B^, which is subsequently modified by 
the partial-def that creates B"'. Finally, there is a materialization of transient 
def-value R°, which is subsequently modified by the partial-def that creates R®. 
In this example, the number of materializations in the optimized code equals 
the number of essential nodes in the version forest. Each of the three essential 
nodes of the version forest (shown in Figure 0) for this basic block can be as- 
sociated with one of these materializations. Node {A°, B^, C^, D^} is associated 
with materialization A°. Node {B^} is associated with materialization B^ (via 
the partial-def to B^). Node {R^} is associated with materialization R° (via the 
partial-def to R^). Nonessential node {R°} is associated with the same material- 
ization (R°) as is associated with its child node, so that the partial-def R® that 
creates the child node can modify the variable R associated with the parent node. 

Now consider the following example, which illustrates changing the destina- 
tion variable of a complete-def, so that a nonessential parent node utilizes the 
same variable as its child node. 
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Example 3. 

(1) A = G + H 

(2) B = cshift(A,shift=+5,dim=l) 

(3) C = cshift(A,shift=+3,dim=l) 

(4) X = C(i) + 3 

(5) B(i) = 50 

(6) A = G - H 

A and B live, C dead at end of basic block. 

The version forest for this basic block is shown in Figure The basic block 
can be optimized by merging A^, B^, and into a single copy stored in B’s official 
location, as shown below. (Thus, the references to A, B, and C in statements (1) 
through (5) are all replaced by references to B.) 

(!') B = (G + H)<+5> 

(4) X = B<-2>(i) + 3 

(5) B(i) = 50 

(6) A = G - H 

The complete-def A^ is transformed into a complete-def to variable B, since 
this permits the version tree with root def-value A^ to be evaluated using only 
one materialization. The materialization is in variable B, rather than variable A, 
because def-value B^ is persistent. Def-value A® is a persistent def, and so uses 
the official location of A. 

Note that the version forest contains two essential nodes, {B^} and {A®}, and 
the optimized code uses two materializations. In the optimized code, essential 
node {B®} utilizes variable B, and is associated with new materialization B^ (via 
the partial-def to B®). Essential node {A®} is associated with materialization A®. 
Nonessential node {A^, B^, C^} is associated with the same materialization (B^ ) 
as is associated with its child node, so that the partial-def B® that creates the 
child node can modify the variable B utilized by the parent node. The reference 
in statement (4) to def-value from node {A^, B^, C^} is replaced by a reference 
to variable B, which holds def-value B^ . 




Fig. 3. Version forest for Example 0 

^ Since the only defs to G and H in the basic block are initial-defs, there is no need to 
include nodes for G° and in the version forest. 
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The following example illustrates how the freedom to rearrange eligible state- 
ments, as permitted by Assumption 0 (Fixed Sequence Assumption), can be 
exploited to reduce the number of materializations. 

Example 4- 

(1) B = cshift(A,shift=+5,dim=l) 

(2) A(i) = 120 

(3) X = B(j) + 3 

(4) C = cshift(A,shift=+2,dim=l) 

A and B dead, C live at end of basic block. 

The version forest for this basic block is shown in Figure 01 Note that both 
nodes of the version forest are essential. The basic block can be optimized by 
moving the complete-def C"' forward so that it occurs before statement (2), and 
letting the partial-def in statement (2) modify C. The optimized code is shown 
below, where old statement (4) is relabeled (4') and is now the first statement 
in the basic block. 

(4') C = cshift(A,shift=+2,dim=l) 

(20 C<"2>(i) = 120 

(30 X = A<+^>(j) + 3 

The optimized code contains two materializations, and is made possible by 
changing the reference to B^ in the demand point statement (3) to be a reference 
to its clone A°. The optimized code utilizes materialization A° for node {A°, B^}, 
and utilizes, via partial-def , new materialization C"' for node {A^, C"'}. 




Fig. 4. Version forest for Example El 



Theorem 2. The number of materializations needed to evaluate a given version 
tree is at least the number of essential nodes. 

Proof Sketch: Let us say that a given def-value utilizes a given materialization 
if the def-value is obtained from the materialization via a sequence of zero or 
more partial-defs. Note that all the def-values utilizing a given materialization 
are def-values for the same variable, and the version forest nodes containing 
these def-values lie along a directed path in the version forest. 
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Let us say that a given reference to a def-value utilizes the materialization 
that is utilized by the referenced def-val. Each non-persistent essential node of 
the given version tree has a nonzero demand point. We say that a final use 
of a non-persistent essential node is a use of a def-value from this node in the 
demand point statement for the node. We envision arbitrarily selecting a final 
use of each non-persistent essential node, and we refer to the materialization 
utilized by this final use as the key utilization of the node. We say that the 
final use of a persistent node is one of the persistent def-values in the node, and 
the key utilization of the node is the materialization utilized by this def-value. 

Now consider the version forest corresponding to an optimized evaluation of 
the basic block, where the optimization is constrained by the assumptions given 
above. Each node of the version forest for the optimized basic block corresponds 
to a node of the given version forest. In the optimized code, the key utilizations 
of two distinct essential nodes cannot be the same materialization. □ 

Next, we note that it is not always possible to independently op- 
timize each version tree, because of possible interactions between the 
initial and persistent def-values of the same variable. This interaction is 
captured by the concept of “persistence conflict”, as defined below. 

Definition 13. A persistence conflict in a basic block is a variable that is 
live on both entry to and exit from the basic block. 

Persistence conflicts may prevent each version tree for a given basic block 
from being evaluated with a minimum number of materializations, as illustrated 
by the following example. 

Example 5. 

(1) B = cshift(A,shift=+5,dim=l) 

(2) A = G + H 

(3) B(i) = 50 

(4) X = B(j) + 3 

A live and B dead at end of basic block. 

The version forest for this basic block is shown in Figure 0 The only es- 
sential nodes are {B^} and {A^}. Each of the two version trees can by itself be 
evaluated using only a single materialization, corresponding to def-values A*^ and 
A^, respectively. However, there is a persistence conflict involving variable A. As- 
sumption 0 prevents statements (2), (3), and (4) from being reordered, so in this 
example, three materializations are necessary. 

We next show that for a version tree with no persistent nodes, the lower 
bound of Theorem 0 bas a matching upper bound. As part of the proof, we 
provide an algorithm for producing the optimized code. 

Theorem 3. Assuming no persistence conflicts, a version tree with no persis- 
tent nodes can be computed using one materialization for each essential node, 
and no other materializations. 
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Fig. 5. Version forest for Example 0 



Proof Sketch: The following algorithm is given the version tree and basic block 
as input, and produces the optimized code. If there are multiple version trees, 
Steps 1 and 2 of the algorithm can be done for each version tree, and then the 
basic block can be transformed. The algorithm associates a variable name with 
each version tree node. We call this variable name the utilization— variable of 
the node. In the code produced by the algorithm, all the uses of def-values from 
a given node of the version tree reference the utilization- variable of that node. 
In the portion of the algorithm that determines the utilization-variable of each 
node, we envision the children of each node being ordered by their origin point, 
so that the child with the last origin point is the rightmost child. 

Step 1. A unique utilization-variable is associated with each essential node, as 
follows. With the following exception, the utilization-variable for each essential 
node is a unique temporary variable!!. The exception occurs if the root def-value 
is an initial-def. In this case, the name of the variable of the initial-def is made 
the utilization-variable for the highest essential node on the path from the root 
of the version tree to the rightmost leaf. 

Step 2. The utilization-variables of nonessential nodes are determined in a 
bottom-up manner, as follows. The utilization- variable of a nonessential node 
is made the same as the utilization-variable of its rightmost child (i.e., the child 
with the maximum origin point). 

Step 3. In the evaluation of the basic block, the ineligible statements occur in 
the same order as given. A materialized shuffle statement is included for each 
child node whose utilization-variable is different from the utilization-variable of 
its parent node. This shuffle statement is placed just after the origin point of the 
parent node. All other eligible statements for the version tree are deletecQ. 

Step 4. For each shuffle statement included in Step 3, the utilization- variable of 
the parent node is made the source of the shuffle statement, and the utilization- 
variable of the child node is made the destination of the shuffle statement. 

Step 5. If the root def-value is a complete-def, then in the evaluation of the 
basic block, the utilization-variable of the root node is used in the left side of the 
statement for the complete-def. For each nonroot node, the utilization-variable 

® Alternately, one of the variable names occurring in the node’s clone set can be used, 
but with each variable name associated with at most one essential node. 

However, a cshift operation involving distributed arrays is replaced by an 
overlap_cshift operation placed just after the origin point of the parent node. 
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of the node is used in the left side of the partial-def statement occurring at the 
origin point of the node. 

Step 6. In the ineligible statements, each use of a def-val from the version tree 
is replaced by a reference to the utilization-variable of the version tree node 
containing the def-value, with appropriate shift annotation if needed. 

Consider the code produced by the above algorithm. In the evaluation of the 
version tree, the number of variable names used equals the number of essential 
nodes, and the number of materializations equals the number of variable names 
used. □ 

In the optimized code for Example ^ node {A°, B^, C^, D^} utilizes material- 
ization A° and variable A, node {B^} utilizes materialization B^ and variable B, 
and nodes {R°} and {R^} utilize materialization R° and variable R. In ExampleEl 
the version tree has only one node ({A^, B^}); the optimized code for this node 
utilizes materialization B^ and variable B. In the optimized code for Example |3 
nodes {A^, B^, C^} and {B®} utilize materialization B^ and variable B, and node 
{A®} utilizes materialization A® and variable A. In the optimized code for Exam- 
ple 0 node {A®, B^} utilizes materialization A® and variable A, and node {A^, C^} 
utilizes materialization and variable C. 

Theorem 4. Assuming no persistence conflicts, the number of materializations 
needed to compute a version tree with no persistent nodes is the number of es- 
sential nodes of the version tree. 

Proof: Immediate from Theorems El and 0 □ 

When there are persistance conflicts, the number of materializations needed 
to compute a version tree is at most the number of essential nodes of the version 
tree, plus the number of extra persistent def- values in persistent nodes, plus the 
number of persistance conflicts. This follows, since each persistance conflict can 
be eliminated by introducing an extra materialization for the variable involved. 
However, when a persistance conflict involves a variable for which the initial 
def-value and persistent def-value both occur in the same version tree, we can 
determine efficiently if this extra materialization is indeed necessary. 

3 Conclusions 

Minimizing nonmaterializations is a deep problem that can be approached from 
an intensional perspective, utilizing an analysis of the role of entire arrays. In a 
single expression, minimizing materializations is mainly an issue of proper array 
accessing. However, in a basic block, the decisions as to which materializations to 
eliminate interact, so minimizing materializations is a combinatorial optimiza- 
tion problem. The concepts of clone sets, version forests, and essential nodes seem 
to model fundamental aspects of the problem. Under the assumptions listed in 
Section 12.21 each essential node of a version forest requires a distinct material- 
ization, thereby establishing a lower bound on the number of materializations 
required. Theorem 0 provides an algorithm when persistent variables are not an 
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issue. The algorithm produces code with one materialization per essential node 
of the version tree, and no additional materializations. The algorithm is optimal, 
in the sense of producing the minimum number of materializations. 
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Abstract. Optimizing compilers have traditionally focused on enhanc- 
ing the performance of a given piece of code. With the proliferation of em- 
bedded software, it is becoming important to identify the energy impact 
of these traditional performance-oriented optimizations and to develop 
new energy-aware schemes. Towards this goal, this paper explores the 
energy consumption behavior of one of the widely-used loop-level com- 
piler optimizations, iteration space tiling, by varying a set of software 
and hardware parameters. 

Our results show that the choice of tile size and input size critically 
impacts the system energy consumption. Specifically, we find that the 
best tile size for the least energy consumed is different from that for 
the best performance. Also, tailoring tile size to the input size generates 
better energy results than working with a fixed tile size. Our results also 
reveal that tiling should be applied more or less aggressively based on 
whether the low power objective is to prolong the battery life or to limit 
the energy dissipated within a package. 



1 Introduction 

With the continued growth in the use of mobile personal devices, the design of 
new energy-optimization techniques has become vital. While many techniques 
have been proposed at the circuit and architectural level, there is clearly a need 
for making the software that executes on these hardware energy-efficient as well. 
This is of importance as the application code that runs on an embedded device 
is the primary factor that determines the dynamic switching activity, one of the 
contributors to dynamic power dissipation. 

Power is the rate at which energy is delivered or exchanged and is measured 
in Watts. Power consumption impacts battery energy-density limitations, cir- 
cuit reliability, and packaging costs. Packaging cost constraints typically require 
the use of cheaper packages that impose strict limits on power dissipation. For 
example, plastic packages typically limit the power dissipation to 2W. Thus, 
limiting the power dissipated within a single package is important to meet cost 
constraints. Energy, measured in Joules, is the power consumed over time and is 
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an important metric to optimize for prolonging battery life. The proliferation of 
embedded and mobile devices has made low power designs vital for prolonging 
the limited battery power. The energy of commercially available re-chargeable 
batteries has improved only 2% per year over the past fifty years. Since the 
entire system is operated by the same battery in such devices, the goal of low 
power optimization must encompass all components in the system. In our work, 
we investigate the trade-offs when both the entire system energy and the en- 
ergy consumed within a single package need to be simultaneously optimized. In 
particular, we study this trade-off when software optimizations are applied. 

Optimizing compilers have traditionally focused on enhancing the perfor- 
mance of a give piece of code (e.g., see m and the references therein). With 
the proliferation of embedded software, it is becoming important to identify the 
energy impact of these traditional performance-oriented optimizations and to 
develop new energy-aware schemes. We believe that the first step in developing 
energy-aware optimization techniques is to understand the influence of widely 
used program transformations on energy consumption. Such an understanding 
would serve two main purposes. First, it will allow compiler designers to see 
whether current performance- oriented optimization techniques are sufficient for 
minimizing the energy-consumption, and if not, what additional optimizations 
are needed. Second, it will give hardware designers an idea about the influence 
of widely used compiler optimizations on energy-consumption, thereby enabling 
them to evaluate and compare different energy-efficient design alternatives with 
these optimizations. 

While it is possible and certainly beneficial to evaluate each and every com- 
piler optimization from energy perspective, in this paper, we focus our attention 
on iteration space (loop) tiling, a popular high-level (loop-oriented) transforma- 
tion technique used mainly for optimizing data locality [2ai2Sll23II3,iniE3|. 
This optimization is important because it is very effective in improving data lo- 
cality and it is used by many optimizing compilers from industry and academia. 
While behavior of tiling from performance perspective has been understood to 
a large extent and important parameters that affect its performance have been 
thoroughly studied and reported, its influence on system energy is yet to be 
fully understood. In particular, its influence on energy consumption of different 
system components (e.g., datapath, caches, main memory system, etc.) is yet to 
be explored in detail. 

Having identified loop tiling as an important optimization, in this paper, 
we evaluate it, with the help of our cycle-accurate simulator, SimplePower 1221. 
from the energy point of view considering a number of factors. The scope of our 
evaluation includes different tiling styles (strategies), modifying important pa- 
rameters such as input size and tile size (blocking factor) and hardware features 
such as cache configuration. In addition, we also investigate how tiling performs 
in conjunction with two recently-proposed energy-conscious cache architectures, 
how current trends in memory technology will affect its effectiveness on different 
system components, and how it interacts with other loop-oriented optimizations 
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as far as energy consumption is concerned. Specifically, in this paper, we make 
the following contributions: 

— We report the energy consumption caused by different styles of tiling using 
a matrix- multiply code as a running example. 

— We investigate energy-sensitivity of tiling to tile size and input size. 

— We investigate its energy performance on several cache configurations includ- 
ing a number of new cache architectures and different technology parameters. 

— We evaluate the energy consumption and discuss the results when tiling is 
accompanied by other code optimizations. 

Our results show that while tiling reduces the energy spent in main memory 
system, it may increase the energy consumed in the datapath and on-chip caches. 
We also observed a great variation on energy performance of tiling when tile 
size is modified; this shows that determining optimal tile sizes is an important 
problem for compiler writers for power-aware embedded systems. Also, tailoring 
tile size to the input size generates better energy results than working with a 
fixed tile size for all inputs. 

The remainder of this paper is organized as follows. In Section El we intro- 
duce our experimental platform and experimental methodology. In Section 0 we 
report energy results for different styles of tiling using a matrix-multiply code 
as a running example and evaluate the energy sensitivity of tiling with respect 
to software and hardware parameters and technological trends. In Section 0 we 
discuss related work and conclude the paper with a summary in Sectional 

2 Our Platform and Methodology 

The experiments in this paper were carried out using the SimplePower energy 
estimation framework m- This framework includes a transition-sensitive, cycle- 
accurate datapath energy model that interfaces with analytical and transition- 
sensitive energy models for the memory and bus sub-systems, respectively. The 
datapath is based on the ISA of the integer subset of the SimpleScalar architec- 
ture 0 and the modeling approach used in this tool has been validated to be 
accurate (average error rate of 8.98%) using actual current measurements of a 
commercial DSP architecture |0. The memory system of SimplePower can be 
configured for different cache sizes, block sizes, associativities, write and replace- 
ment policies, number of cache sub-banks, and cache block buffers. SimplePower 
uses the on-chip cache energy model proposed in |Sj using 0.8/i technology pa- 
rameters and the off-chip main memory energy per access cost based on the 
Cypress SRAM CY7C1326-133 chip. In our design, the datapath and instruc- 
tion and data caches are assumed to be in a single package and the main memory 
in a different package. We input our C codes into this framework to obtain the 
energy results. 

All the tiled codes in this paper are obtained using an extended version of 
the source-to-source optimization framework discussed in mg. Each tiled version 
is named by using the indices of the loops that have been tiled. For example, i j 
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denotes a version where only the i and j loops have been tiled. Unless otherwise 
stated, in both the tilt loops (i.e., the loops that iterate over tiles) and the element 
loops (i.e., the loops that iterate over elements in a given tile), the original order 
of loops is preserved except that the untiled loop(s) is (are) placed right after 
the tile loops and all the element loops are placed into the innermost positions. 
Note that the matrix-multiply code is fully permutable m and all styles of 
tilings are legal from the data dependences perspective. We also believe that the 
matrix-multiply code is an interesting case study because (as noted by Lam et al. 
M) locality is carried in three different loops by three different array variables. 
However, similar data reuse patterns and energy behaviors can be observed in 
many codes from the signal and video processing domains. 

In all the experiments, our default data cache is 4 KB, one-way associative 
(direct-mapped) with a line size of 32 bytes. The instruction cache that we 
simulated has the same configuration. All the reported energy values in this 
paper are in Joules (J). 



3 Experimental Evaluation 

3.1 Tiling Strategy 

In our first set of experiments, we measure the energy consumed by the matrix- 
multiply code tiled using different strategies. Figure □ shows the total energy 
consumption of eight different versions of the matrix-multiply code (one original 
and seven tiled) for an input size of N=50. We observe that tiling reduces the 
overall energy consumption of this code. 

In order to further understand the energy behavior of these codes, we break 
down the energy consumption into different system components: datapath, data 
cache, instruction cache, and main memory. As depicted in Figure E tiling a 
larger number of loops in general increases the datapath energy consumption. 
The reason for this is that loop tiling converts the input code into a more com- 
plex code which involves complicated loop bounds, a larger number of nests, 
and macro/function calls (for computing loop upper bounds). All these cause 
more branch instructions in the resulting code and more comparison operations 
that, in turn, increase the switching activity (and energy consumption) in the 
datapath. For example, tiling only the i loop increases the datapath energy con- 
sumption by approximately 10%. 

A similar negative impact of tiling (to a lesser extent) is also observed in 
the instruction cache energy consumption (not shown in figure). When we tile, 
we observe a higher energy consumption in the instruction cache. This is due to 
the increased number of instructions accessed from the cache due to the more 
complex loop structure and access pattern. However, in this case, the energy 
consumption is not that sensitive to the tiling strategy, mainly because the small 
size of the matrix-multiply code does not put much pressure on the instruction 
cache. We also observe that the number of data references increase as a result 
of tiling. This causes an increase in the data cache energy. We speculate that 
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this behavior is due to the influence of back-end compiler optimizations when 
operating on the tiled code. 

When we consider the main memory energy, however, the picture totally 
changes. For instance, when we tile all three loops, the main memory energy 
becomes 35.6% of the energy consumed by the untiled code. This is due to 
reduced number of accesses to the main memory as a result of better data locality. 
In the overall energy consumption, the main memory energy dominates and the 
tiled versions result in significant energy savings. To sum up, we can conclude that 
{for this matrix-multiply code) loop tiling increases the energy consumption in 
datapath, instruction cache, and data cache, but significantly reduces the energy 
in main memory. Therefore, if one only intends to prolong the battery life, one 
can apply tiling aggressively. If, on the other hand, the objective is to limit 
the energy dissipated within each package (e.g., the datapath-l-caches package), 
one should be more careful as tiling tends to increase both datapath and cache 
energies. It should also be mentioned that, in all the versions experimented with 
here, tiling improved the performance (by reducing execution cycles). Hence, it 
is easy to see that since there is an increase in the energy spent (and decrease 
in execution time) within the package that contains datapath and caches, tiling 
causes an increase in average power dissipation for that package. 




. 3 ^ 





Fig. 1. Energy consumptions of different tiling strategies. 
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3.2 Sensitivity to the Tile Size and Input Size 

While the tile size sensitivity issue has largely been addressed in performance- 
oriented studies d s n d ng, the studies that look at the problem from 
energy perspective are few izq. Thus, we explored the influence of tile size on 
energy consumption. We summarize the key observations from our experiments 

m 

— Increasing the tile size reduces the datapath energy and instruction cache 
energy. This is because a large tile size (blocking factor) means smaller loop 
(code) overhead. However, as in the previous set of experiments, the overall 
energy behavior is largely determined by the energy spent in main memory. 

— There is little change in data cache energy as we vary the tile size. This is 
because the number of accesses to the data cache is almost the same for all 
tile sizes (in a given tiling strategy). 

— The energy consumption of a given tiling strategy depends to a large extent 
on the tile size. For instance, when N is 50 and both j and k loops are 
tiled, the overall energy consumption can be as small as 0.064J or as large 
as 0.128J depending on the tile size chosen. It was observed that, for each 
version of tiled code, there is a most suitable tile size beyond which the 
energy consumption starts to increase. 

— For a given version, the best tile size from energy point of view was different 
from the best tile size from the performance {execution cycles) point of view. 
For example, as far as the execution cycles are concerned, the best tile size 
for the ik version was 10 (instead of 5). 

Next, we varied the input size (N) and observed that for each input size 
there exists a best tile size from energy point of view. To be specific, the best 
possible tile sizes (among the ones we experimented with ijk version) for input 
sizes of 50, 100, 200, 300, and 400 are 10, 20, 20, 15, and 25, respectively. Thus, 
it is important for researchers to develop optimal tile size detection algorithms 
(for energy) similar to the algorithms proposed for detecting best tile sizes for 
performance (e.g., gl EH EH El). 



3.3 Sensitivity to the Cache Configuration 

In this subsection, we evaluate the data cache energy consumption when the un- 
derlying cache configuration is modified. We experiment with different cache sizes 
and associativities as well as two energy-efficient cache architectures, namely, 
block buffering j221El and sub-banking [2211^. 

In the block buffering scheme, the previously accessed cache line is buffered 
for subsequent accesses. If the data within the same cache line is accessed on 
the next data request, only the buffer needs to be accessed. This avoids the 
unnecessary and more energy consuming access to the entire cache data array. 
Thus, increasing temporal locality of the cache line through compiler techniques 
such as loop tiling can save more energy. In the cache sub-banking optimization. 
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the data array of the cache is divided into several sub-banks and only the sub- 
bank where the desired data is located is accessed. This optimization reduces 
the per access energy consumption and is not influenced by locality optimization 
techniques. We also evaluate cache conflgurations that combine both these op- 
timizations. In such a conflguration with block buffering and sub-banking, each 
sub-bank has an individual buffer. Here, the scope for exploiting locality is lim- 
ited as compared to applying only block buffering as the number of words stored 
in a buffer is reduced. However, it provides the additional benefits of sub-banking 
for each cache access. 

We first focus on traditional cache model and present in Figure El the energy 
consumed only in data cache for different cache sizes and associativities. We 
experiment with two different codes (with N=200), the untiled version and a 
blocked version where all three loops (*, j, and k) are tiled with a tile size of 
twenty. Our first observation is that the data cache energy is not too sensitive 
to the associativity but, on the other hand, is very sensitive to the cache size. 
This is because for a given code, the number of read accesses to the data cache 
is constant and, the cache energy per data access is higher for a larger cache. 
Increasing associativity also increases per access cost for cache (due to increased 
bit line and word line capacitances), but its effect is found to be less significant 
as compared to the increase in bit line capacitance due to increased cache sizes. 
As a result, embedded system designers need to determine minimum data cache 
size for the set of applications in question if they want to minimize data cache 
energy. Another observation is that for all cache sizes and associativities going 
from the untiled code to tiled code increases the data cache energy. 

We next concentrate on cache line size and vary it between 8 bytes and 
64 bytes for N=200 and T=50. The energy consumption of the ijk version 
for line sizes of 8, 16, 32, and 64 bytes were 0.226J, 0.226J, 0.226J, and 0.227J, 
respectively, indicating that (for this code) the energy consumption in data cache 
is relatively independent from the line size. It should also be mentioned that 
while increases in cache size and degree of associativity might lead to increases 
in data cache energy, they generally reduce the overall memory system energy 
by reducing the number of accesses to the main memory. 

Finally, we focus on block buffering and sub-banking, and in Figure El give 
the data cache energy consumption for different combinations of block buffering 
(denoted bb) and sub-banking (denoted sb) for both the untiled and tiled (the 
ijk version) codes. The results reveal that for the best energy reduction block 
buffering and sub-banking should be used together. When used alone, neither 
sub-banking nor block buffering is much effective. The results also show that 
increasing the number of block buffers does not bring any benefit (as there is 
only one reference with temporal locality in the innermost loop). It should be 
noted that the energy increase caused by tiling on data cache can (to some 
extent) be compensated using a conflguration such as bb+sb. 
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Fig. 2. Impact of cache size and associativity on data cache energy. 




bb 2bb sb bb+sb 2bb+sb bb 2bb sb bb+sb 2bb+sb 
1w 2w 



N=200, T=20(ijk) 




Fig. 3. Impact of block buffering (bb) and sub-banking (sb) on data cache en- 
ergy. 



3.4 Cache Miss Rates Vs. Energy 

We now investigate the relative variations in miss rate and energy performance 
due to tiling. The following three measures are used to capture the correlation 
between the miss rates and energy consumption of the unoptimized (original) 
and optimized (tiled) codes. 

Miss rate of the original code 

Improvementjj, = , 

Miss rate of the optimized code 

Memory energy consumption of the original code 

Improvement = , 

Memory energy consumption of the optimized code 

Total energy consumption of the original code 

Improvement = . 

Total energy consumption of the optimized code 

In the following discussion, we consider four different cache configurations: 
IK, 1-way; 2K, 4-way; 4K, 2-way; and 8K, 8-way. Given a cache configuration. 
Table [D shows how these three measures vary when we move from the original 
version to an optimized version. 
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Table 1. Improvements in miss rate and energy consumption. 





IK, 1-way 


2K, 4-way 


4K, 2-way 


8K, 8-way 


Improvement^, 


6.21 


63.31 


20.63 


19.50 


Improvement^ 


2.13 


18.77 


5.75 


2.88 


Improvement^ 


1.96 


9.27 


3.08 


1.47 



We see that in spite of very large reductions in miss rates as a result of tiling, 
the reduction in energy consumption is not as high. Nevertheless, it still follows 
the miss rate. We also made the same observation in other codes we used. We 
have found that Improvement^ is smaller than Improvement^ by a factor of 2 
- 15. Including the datapath energy makes the situation worse for tiling (from 
the energy point of view) , as this optimization in general increases the datapath 
energy consumption. Therefore, compiler writers for energy aware systems can 
expect an overall energy reduction as a result of tiling, but not as much as the 
reduction in the miss rate. We believe that some optimizing compilers (e.g., HOI) 
that estimate the number of data accesses and cache misses statically at compile 
time can also be used to estimate an approximate value for the energy variation. 
This variation is mainly dependent on the energy cost formulation parameterized 
by the number of hits, number of misses, and cache parameters. 



3.5 Interaction with Other Optimizations 

In order to see how loop tiling gets affected by other loop optimizations, we 
perform another set of experiments where we measure the energy consumption 
of tiling with linear loop optimization (loop interchange |2ti| to be specific) and 
loop unrolling m- Loop interchange modifies the original order of loops to ob- 
tain better cache performance. In our matrix-multiply code, this optimization 
converts the original loop order i,j,k (from outermost to innermost) to i,k,j, 
thereby obtaining spatial locality for arrays b and c, and temporal locality for 
array a, all in the innermost loop. We see from Figure 0 that tiling (in gen- 
eral) reduces the overall energy consumption of even this optimized version of 
the matrix-multiply nest. Note however that it increases the datapath energy 
consumption. Comparing these graphs with those in Figure ^ we observe that 
interchanged tiled version performs better than the pure tiled version, which 
suggests that tiling should be applied in general after linear loop transformations 
for the best energy results. 

The interaction of tiling with loop unrolling is more complex. Loop unrolling 
reduces the iteration count by doing more work on a single loop iteration. We 
see from Figure Q that untiled loop unrolling may not be a good idea as its 
energy consumption is very high. Applying tiling brings the energy consumption 
down. Therefore, in circumstances where loop unrolling must be applied {e.g., 
to promote register reuse and/or to improve instruction level parallelism), we 
suggest to apply tiling as well to keep the energy consumption under control. 
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linear transformation loop unrolling 



Fig. 4. Interaction of loop tiling with loop interchange and unrolling. 



3.6 Analysis of Datapath Energy 

We next zoom-in on the datapath energy, and investigate the impact of tiling 
on different components of the datapath as well as on different stages of the 
pipeline. Table |2| shows the breakdown of the datapath energy for the matrix- 
multiply code into different hardware components. For comparison purposes, we 
also give the breakdown for two other optimizations, loop unrolling (denoted 
u) and linear loop transformation (denoted 1), as well as different combinations 
of these two optimizations and tiling (denoted t), using an input parameter of 
N=100. Each entry in this table gives the percentage of the datapath energy 
expended in the specific component. We see that (across different versions) the 
percentages remain relatively stable. However, we note that all the optimizations 
increase the (percentage of) energy consumption in functional units due to more 
complex loop nest structures that require more computation in the ALU. The 
most significant increase occurs with tiling, and it is more than 26%. These 
results also tell us that most of the datapath energy is consumed in register 
files and pipeline registers, and therefore the hardware designers should focus 
more on these units. Table 0 on the other hand, gives the energy consumption 
breakdown across five pipeline stages (the fetch stage IF, the instruction decode 
stage ID, the execution stage EXE, the memory access stage MEM, and the write- 
back stage WB). The entries under the MEM and IF stages here do not involve 
the energy consumed in data and instruction cache memory, respectively. We 
observe that most of the energy is spent in the ID, EXE, and WB stages. Also, the 
compiler optimizations in general increase the energy consumption in the EXE 
stage since that is the where the ALU sits; this increase is between 1% and 8% 
and also depends on the program being run. 

3.7 Sensitivity to Technology Changes 

The main memory has been a major performance bottleneck and has attracted 
a lot of attention PH. Changes in process technology have made possible to 
embed a DRAM within the same chip as the processor core. Initial results using 
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Table 2. Datapath energy breakdown (in %s) in hardware components level. 



Version 


Register 

File 


Pipeline 

Registers 


Functional 

Units 


Data-path 

Muxes 


unoptimized 


35.99 


36.33 


15.76 


8.36 


1 


36.09 


34.87 


17.34 


8.11 


u 


36.19 


36.17 


15.98 


8.31 


t 


34.60 


33.56 


19.93 


7.80 


1+u 


35.87 


34.12 


18.19 


7.93 


1+t 


35.27 


33.74 


19.25 


8.17 


t+u 


35.31 


35.07 


17.89 


8.06 


t+l+u 


35.41 


34.15 


18.38 


7.96 



Table 3. Datapath energy breakdown (in %s) in pipeline stage level. 



Version 


IF 


ID 


EXE 


MEM 


WB 


unoptimized 


3.33 


22.94 


33.17 


8.70 


31.87 


1 


3.10 


23.88 


34.20 


8.32 


30.50 


u 


3.18 


23.93 


33.47 


8.63 


30.78 


t 


3.25 


24.04 


35.91 


7.95 


28.85 


1+u 


2.97 


24.83 


34.81 


8.13 


29.27 


1+t 


2.95 


23.23 


35.73 


8.07 


30.02 


t+u 


3.15 


23.61 


34.78 


8.34 


30.12 


t+l+u 


2.95 


24.63 


35.02 


8.14 


29.26 



embedded DRAM (eDRAM) show an order of magnitude reduction in the energy 
expended in main memory m- Also, there have been significant changes in the 
DRAM interfaces |SI that can potentially reduce the energy consumption. For 
example, unlike conventional DRAM memory sub-systems that have multiple 
memory modules that are active for servicing data requests, the direct RDRAM 
memory sub-system delivers a full bandwidth with only one RDRAM module 
active. Also, based on the particular low power modes that are supported by 
the memory chips and based on how effectively they are utilized, the average 
per access energy cost for main memory can be reduced by up to two orders of 
magnitude jO]. 

In order to study the influence of changes in £„ due to these technology trends, 
we experiment with different £„ values that range from 4.95 x 10“® (our default 
value) to 2.475 x 10“^^. We observe from Figure Elthat from £m = 4.95 x 10“® on, 
the main memory energy starts to lose its dominance and instruction cache and 
datapath energies constitute the largest percentage. While this is true for both 
tiled and untiled codes, the situation in tiled codes is more dramatic as can be 
seen from the figure. For instance, when £„ = 2.475 x 10“^°, N=100, and T=10, 
the datapath energy is nearly 5.7 times larger than the main memory energy 
(which includes the energy spent in both data and instruction accesses), and the 
Icache energy is 13.7 times larger than the main memory energy. With the untiled 
code, however, these values are 0.98 and 2.43, respectively. The challenge for 
future compiler writers for power aware systems then is to use tiling judiciously 
so that the energy expended in datapath and on-chip caches can be kept under 
control. 
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Fig. 5. Energy consumption with different Em (J) values. 



3.8 Other Codes 

In order to increase our confidence in our observations on the matrix-multiply 
code, we also performed tiling experiments using several other loop nests that 
manipulate multi-dimensional arrays. The salient characteristics of these nests 
are summarized in Table 0 The first four loop nests in this table are from 
the Spec92/Nasa7 benchmark suite; syr2k.l is from Bias; and htribk.2 and 
qzhes.4 are from the Eispack library. For each nest, we used several tile sizes, 
input sizes, and per memory access costs, and found that the energy behavior 
of these nests are similar to that of the matrix-multiply. However, due to lack of 
space, we report here only the energy break-down of the untiled and two tiled 
codes (in a normalized form) using two representative Em values (4.95 x 10“® 
and 2.475 x 10“^^). In Figure H for each code, the three bars correspond to 
the untiled, tiled (ijk,T=5), and tiled (ijk,T=10) versions, respectively, from 
left to right. Note that while the main memory energy dominates when Em is 
4.95 X 10“®, the instruction cache and datapath energies dominate when Em is 
2.475 X lO-^b 



Table 4. Benchmark nests used in the experiments. The number following the 
name corresponds to the number of the nest in the respective code. 



nest 


arrays 


data size 


tile sizes 


btrix . 4 


two 4-D 


21.0 MB 


10 and 20 


vpenta. 3 


one 3-D and five 2-D 


16.6 MB 


20 and 40 


cholesky.2 


one 3-D 


10 MB 


10 and 20 


emit .4 


one 2-D and one 1-D 


2.6 MB 


50 and 100 


htribk.2 


three 2-D 


72 KB 


12 and 24 


syr2k . 1 


three 2-D 


84 KB 


8 and 16 


qzhes . 4 


one 2-D 


160 KB 


15 and 30 
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Fig. 6. Normalized energy consumption of example nested loops with two dif- 
ferent Em (J) values. 



4 Related Work 

Compiler researchers have attacked the locality problem from different perspec- 
tives. The works presented in Wolf and Lam pi], Li Coleman and McKinley 
g], Kodukulaet al. gg, Lam et al. d. and Xue and Huang m, among others, 
have suggested tiling as a means of improving cache locality. In m and m, the 
importance of linear locality optimizations before tiling is emphasized. While 
all of these studies have focussed on the performance aspect of tiling, in this 
paper, we investigate its energy behavior and show that the energy behavior of 
tiling may depend on a number of factors including tile size, input size, cache 
configuration, and per memory access cost. 

Shiue and Chakrabarti m presented a memory exploration strategy based 
on three metrics, namely, processor, cycles, cache size, and energy consumption. 
They have found that increasing tile size and associativity reduces the number of 
cycles but does not necessarily reduce the energy consumption. In comparison, 
we focus on the entire system (including datapath and instruction cache) and 
study the impact of a set of parameters on the energy behavior of tiling using 
several tiling strategies. We also show that datapath energy consumption due to 
tiling might be more problematic in the future, considering the current trends 
in memory technology. The IMEC group |2| was among the first to work on 
applying loop transformations to minimize power dissipation in data dominated 
embedded applications. 

In this paper, we utilize the framework that was proposed in m This frame- 
work has been used to investigate the energy influence of a set of high-level 
compiler optimizations that include tiling, linear loop transformations, loop un- 
rolling, loop fusion, and distribution HD- However, the work in HH accounts 
for only the energy consumed in data accesses and does not investigate tiling in 
detail. In contrast, our current work looks at tiling in more detail, investigating 
different tiling strategies, influence of varying tile sizes, and the impact of input 
sizes. Also, in this paper, we account for the energy consumed by the entire 
system including the instruction accesses. While the work in studies differ- 
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ent tiling strategies, this paper focuses on impact of different tiling parameters, 
the performance versus energy consumption impact, and the interaction of tiling 
with other high-level optimizations. 

5 Conclusions 

When loop nest based computations process large amounts of data that do not 
fit in cache, tiling is an effective optimization for improving performance. While 
previous work on tiling has focussed exclusively on its impact on performance 
(execution cycles), it is critical to consider its impact on energy as embedded and 
mobile devices are becoming the tools for mainstream computation and start to 
take the advantage of high-level and low-level compiler optimizations. 

In this paper, we study the energy behavior of tiling considering both the 
entire system and individual components such as datapath, caches, and main 
memory. Our results show that the energy performance of tiling is very sensitive 
to input size and tile size. In particular, selecting a suitable tile size for a given 
computation involves tradeoffs between energy and performance. We find that 
tailoring tile size to the input size generally results in lower energy consumption 
than working with a fixed tile size. Since the best tile sizes from the performance 
point of view are not necessarily the best tile sizes from the energy point of view, 
we suggest experimenting with different tile sizes to select the most suitable one 
for a given code, input size, and technology parameters. Also, given the current 
trends in memory technology, we expect that the energy increase in datapath 
due to tiling will demand challenging tradeoffs between prolonging battery life 
and limiting energy dissipated within a package. 
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Abstract. Embedded systems consisting of the application program ROM, 
RAM, the embedded processor core, and any custom hardware on a single wafer 
are becoming increasingly common in application domains such as signal pro- 
cessing. Given the rapid deployment of these systems, programming on such sys- 
tems has shifted from assembly language to high-level languages such as C, C-l-l-, 
and Java. The processors used in such systems are usually targeted toward spe- 
cific application domains, e.g., digital signal processing (DSP). As a result, these 
embedded processors include application-specific instruction sets, complex and 
irregular data paths, etc., thereby rendering code generation for these processors 
difficult. In this paper, we present new code optimization techniques for embed- 
ded fixed point DSP processors which have limited on-chip program ROM and 
include indirect addressing modes using post-increment and decrement opera- 
tions. We present a heuristic to reduce code size by taking advantage of these ad- 
dressing modes. Our solution aims at improving the offset assignment produced 
by Liao et al.’s solution. It finds a layout of variables in RAM, so that it is possible 
to subsume explicit address register manipulation instructions into other instruc- 
tions as a post-increment or post-decrement operation. Experimental results show 
the effectiveness of our solution. 



1 Introduction 

With the falling cost of microprocessors and the advent of very large scale integra- 
tion, more and more processing power is being placed in portable electronic devices 
0iiEiiia. Such processors (in particular, fixed-point DSPs and micro-controllers) 
can be found, for example in audio, video, and telecommunications equipment and have 
severely limited amounts of memory for storing code and data, since the area available 
for ROM and RAM is limited. This renders the efficient use of memory area very crit- 
ical. Since the program code resides in the on-chip ROM, the size of the code directly 
translates into silicon area and hence the cost. The minimization of code size is, there- 
fore, of considerable importance fTlEini 15117117111^ rHlfT^ 1731 Ibll . while simultaneously 
preserving high levels of performance. However, current compilers for fixed-point DSPs 
generate code that is quite inefficient with respect to code size and performance. As a 
result, most application software is hand-written or at least hand-optimized, which is a 
very time consuming task [0 ■ The increase in developer productivity can therefore be 
directly linked to improvement in compiler techniques and optimizations. 

S.P. Midkiff et al. (Eds.): LCPC2000, LNCS 2017, pp. ISS-ITtTI 2001. 
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Many embedded processor architectures such as the TI TMS320C25 include indi- 
rect addressing modes with auto-increment and auto-decrement arithmetic. This feature 
allows address arithmetic instructions to he part of other instructions. Thus, it elimi- 
nates the need for explicit address arithmetic instructions wherever possible, leading 
to decreased code size. The memory access pattern and the placement of variables has 
a significant impact on code size. The auto-increment and auto-decrement modes can 
be better utilized if the placement of variables is performed after code selection. This 
delayed placement of variables is referred to as ojfset assignment. 

This paper considers the Simple Offset Assignment (SOA) problem where there is 
just one address register. A solution to the problem assigns optimal frame-relative off- 
sets to the variables of a procedure, assuming that the target machine has a single in- 
dexing register with only the indirect, auto-increment and auto-decrement addressing 
modes. The problem is modeled as follows. A basic block Eni is represented by an ac- 
cess sequence, which is a sequence of variables written out in the order in which they are 
accessed in the high level code. This sequence is in turn further condensed into a graph 
called the access graph with weighted undirected edges. The SOA problem is equiv- 
alent to a graph covering problem, called the Maximum Weight Path Cover (MWPC) 
problem. A solution to the MWPC problem gives a solution to the SOA problem. We 
present a new algorithm, called Incremental-SolveSOA, for the SOA problem and 
compare its performance with previous work on the topic. 

The remainder of this paper is organized as follows. We present a brief explanation 
of graphs and some additional required notation and background in SectionQ Then, in 
Sectional we consider the problem of storage assignment, where the arithmetic permit- 
ted on the address register is plus or minus 1 . We present our experimental results in 
Section^ We conclude the paper with a summary in Sectional 

2 Background 

We model the sequence of data accesses as weighted undirected graphs |7'|. Each vari- 
able in the program corresponds to a vertex (or node) in the graph. An edge i, j indicates 
that variable i is accessed after j or vice-versa; the weight of an edge w(i,j) denotes 
the number of times variables i and j are accessed successively. 

Definition 1 Two paths are said to be disjoint if they do not share any vertices. 



Definition 2 A disjoint path cover (which will be referred to as just a ‘cover’) of a 
weighted graph G(V, E) is a subgraph C(V, E') of G such that, for every vertex v in 
G, deg(v) < 3 and there are no cycles in C. The edges in C may be a non-contiguous 
set 

Definition 3 The weight of a cover G is the sum of the weights of all the edges in C 
/EF- The cost of a cover C is the sum of the weights of all edges in G but not in G: 

cost(G) = w(e) 

(eeE)A(e^C) 
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c = c + d + f; ARO 

a = h + c; 
b = b + e; 
c = g-b; 
a = a - c; 

(a) 
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(c) 



Fig. 1. Code example from [5,6]. 



2.1 Motivating Example 

As mentioned earlier, many embedded processors provide register indirect addressing 
modes with auto-increment and auto-decrement arithmetic ||3 . It is possible to use these 
modes for efficient sequential access of memory and improve code density. The place- 
ment of variables in memory has a large impact on the exploitation of auto-increment 
and auto-decrement addressing modes, which is in turn affected hy the pattern in which 
variables are accessed. If the assignment of location of variables is done after code se- 
lection, then we get the freedom of assigning locations to variables depending on the 
order in which the variables are accessed. The placement of variables in storage has a 
considerable impact on code size and performance. 

Consider the C code sequence shown in FigureQ][a), an example from [5J; let the 
placement of variables in memory be as in Figure Hb). This assignment of variables 
to memory locations here is based on first use, i.e., as the variables are referred to in 
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the high level code, the variables are placed in memory. The assembly code for this 
section of C code is shown in Figure[IJc). The first column shows the assembly code, 
the second column shows the register transfer, and the third, the contents of the address 
pointer. The instructions in bold are the explicit address pointer arithmetic instructions, 
i.e., SBAR, Subtract Address Register and ADAR, Add Address Register. The objective 
of the solution to the SOA problem is to find the minimal address pointer arithmetic in- 
structions required using proper placement of variables in memory. A brief explanation 
of Figure[I]follows. 

The first instruction LDAR ARO, &c loads the address of the first variable ‘c’ into 
the address register ARO. The next instruction LOAD (AR0)+ loads the variable ‘c’ into 
the accumulator. This instruction shows the use of the auto-increment mode. Ordinarily, 
we would need an explicit pointer increment to get it to point to ‘d’ which is the next 
required variable, but it is subsumed into the LOAD instruction in the form of a post- 
increment operation, indicated by the trailing ‘H-’ sign. The pointer decrement operation 
can also be similarly subsumed by a post-decrement operation indicated by a trailing 
sign for example as in the ADD *(AR0)- instruction. It can be seen, that the instructions 
in bold are the ones that do only address pointer arithmetic in Figure IHa). The number 
of such instructions in the generated code may be very high, as typically the high-level 
programmer does not consider the variable layout while writing the program. ARO is 
auto-incremented after the first LOAD instruction. Now, ARO is pointing to ‘d ’ , as ‘d’ is 
the next variable required, so it can be accessed immediately without having to change 
ARO. Similarly for variable ‘ f\ the next variable required is ‘c’, which is at a distance 
2 from ‘/ ’■ Consider the STOR instruction that writes the result back to ‘c’, an explicit 
SBAR ARO, 2 instruction has to be used to set ARO to point to ‘c’, because the address 
of ‘/’ and that of ‘c’ differ by two and auto-decrement cannot be used along with the 
previous ADD instruction. This can be seen in the other instances of ADAR and SBAR, 
where for every pair of accesses that do not refer to adjacent variables, either an SBAR 
or ADAR instruction must be used. In total, ten such instruction are needed to execute 
the code in Figure |IJa), given the offset assignment of FigureOIb). 

2.2 Assumptions in SOA 

The simple offset assignment (SOA) problem is one of assigning a frame-relative offset 
to each local variable to minimize the number of address arithmetic instructions (ADAR 
and SBAR) required to execute a basic block. The cost of an assignment is hence defined 
by the number of such instructions. With a single address register, the initializing LDAR 
instruction is not included in this cost. We make the following assumptions for the 
SOA problem: (1) every data object is of size one word; (2) a single address register 
is used to address all variables in the basic block; (3) one-to-one mapping of variables 
to locations; (4) the basic block has a hxed evaluation order; (and) (5) special features 
such as address wrap-around are not considered. 

2.3 Approach to the Problem 

The SOA problem can be formulated as a graph covering problem, called the Maximum 
Weight Path Covering Problem (MWPC) [00. From a basic block, a graph, called the 
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access graph is derived, that shows the various variables and their relative adjacency 
and frequency of accesses. From the solution to the MWPC problem, a minimum cost 
assignment can be constructed. 



cdfchcabebgbcaca 

(a) 




fb) 



c 


d 




h 


a 


b 


e 


g 



offset assignment 




Fig. 2. (a) Access sequence; (b) Access graph; (c) Offset assignment and cover C (thick 
edges). 



Given a code sequence S that represents a basic block, one can define a unique 
access sequence for that block In an operation ‘z = x op y\ where ‘op’ is some 
binary operator, the access sequence is given by ‘xyz’. The access sequence for an 
ordered set of operations is simply the concatenated access sequences for each operation 
in the appropriate order. For example, the access sequence for the C code example in 
FigureQJa) is shown in FigureQa). 

With the definition of cost given earlier, it can be seen that the cost is the number 
of consecutive accesses to variables that are not assigned to adjacent locations. The 
access sequence is the sequence of memory references made by a section of code and 
it can be obtained from the high-level language program. The access sequence can 
be summarized in an edge weighted, undirected graph. The access graph G{V, E) is 
derived from an access sequence as follows. Each vertex v & V of the access graph 
corresponds to a unique variable in the basic block. An edge e{u, v) G E between 
vertices u and v exists with weight w{e) if variables u and v are adjacent to each other 
w{e) times in the access sequence. The order of the accesses is not significant as either 
auto-increment or auto-decrement can be performed. The access graph for Figure Ela) 
is shown in Figure I3b). 
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2.4 SOA and Maximum Weight Path Cover 

Given the definitions earlier in this paper, if a maximum weight cover for a offset as- 
signment graph is found, then that also means that the minimum cost assignment has 
also been found. Given a cover C of G the cost of every offset assignment implied by C 
is less than or equal to the cost of the cover Q. Given an offset assignment A and an ac- 
cess graph G, there exists a disjoint path cover which implies A and which has the same 
cost as A. Every offset assignment implied by an optimal disjoint path cover is optimal. 
An example of a sub-optimal path cover is shown in Figure|2tc). The thick lines show 
the disjoint path cover and the corresponding offset assignment is also shown. The cost 
of this assignment is 10. This can be seen from the edges not in the cover. 



(a) 




(b) 

Fig. 3. Optimal offset assignment and cover G (thick edges). 



3 An Incremental Algorithm for SOA 

3.1 Previous Work 

Bartley 121 and Liao 10 0] studied the simple offset assignment problem. Liao for- 
mulated the simple offset assignment problem. The problem was modeled as a graph 
theoretic optimization problem similar to Bartley Q and shown to be equivalent to the 
Maximum Weighted Path Cover (MWPC) problem. This problem is proven to be NP- 
hard. A heuristic solution to the above problem proposed by Liao will be explained in 
the following subsection. Consider the example shown earlier. Using Liao’s algorithm 
we get an offset assignment as shown in Figure 0a) which is implied by the access 
graph in Figure 0b). The cover of the access graph is shown by the heavy edges, and 
in this case it is optimal. This can be seen from the graph itself. Picking any of the 
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c = c + d + f; 
a = h - c; 
b = b + e; 
c = g - b; 
a = a - c; 
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ic) 



Fig. 4. Code after the optimized offset assignment. 



four non-included edges will cause the dropping of some edge from the cover which 
will in turn increase the cost of the cover. The assembly code for the offset assignment 
implied by the cover is shown in Figure 0^c). The address arithmetic instructions are 
highlighted and there are four such instructions corresponding to the four edges not in 
the cover. For example because ‘a’ and ‘b’ could not be placed adjacent to each other, 
we need to use the instruction ADAR *(AR0) 4. The offset assignment that this section 
of code needs to use is shown in Figure0^b), along with the C code (Figure a)) for 
reference. Leupers and Marwedel present a heuristic for choosing among different 
edges with the same weight in Liao’s heuristic. 

3.2 Liao’s Heuristic for SOA 

Because SOA and MWPC are NP-hard, a polynomial-time algorithm for solving these 
problems optimally is not likely to exist unless P=NP. Liao’s heuristic for the simple 
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offset assignment problem is shown in Figure^ This algorithm is similar to Kruskal’s 
minimum spanning tree algorithm H. The heuristic is greedy, in the sense that it re- 
peatedly selects the edge that seems best at the current moment. 



1 //INPUT : Access Sequence, L 

2 // OUTPUT : Constructed Assignment E' 

3 Procedure Solve — SOA(L) 

4 G{V,E) ■<— AccessGraph(L) 

5 Esort <— Sorted edges in E in descending order of weight 

6 // Initialize C{V' , E') the constructed cover 

7 E' ^{} 

8 V' 

9 while (\E'\ < \ V\ — 1 and Eson not empty) do 

10 e <— first edge in Esort 

1 1 Esort Esort C 

12 if ((e does not cause a cycle in C) and 

13 (e does not cause any vertex in V' to have degree > 2)) 

14 add e to E' 

15 else 

16 discard e from Esort 

17 endif 

18 enddo 

19 return E' 



Fig. 5. Liao’s maximum weight path cover heuristic 0. 



Consider the algorithm Solve-SOA(L) in Figure 0 This algorithm takes as input a 
sequence ‘L’ which uniquely represents the high level code, and produces as output an 
offset assignment. In line 4, graph G{V, E) is produced from the access sequence ‘L’ . 
Producing the access sequence takes 0{L) time. Line 5 produces a list of sorted edges 
in descending order of weight. C{V' ,E') is the cover of the graph G, which starts with 
all the vertices included but no edges. The condition for the while statement makes 
sure that no more than V — 1 edges are selected, as that is the maximum needed for 
any cover. If the cover is disjoint, the order in which the disjoint paths are positioned 
does not matter as far as the cost of the offset assignment is concerned, because the 
cost of moving from one path to another will always have to be paid. The complexity 
of Liao’s heuristic is 0{\E\ log \E\ -i-|L|) 0, where |F/| is the number of edges in the 
access graph and |L| is the length of the access sequence. Construction of the access 
sequence takes 0{\L\) time. The (|F/| log |F/|) term is due to the need to sort the edges 
in descending order of weight. The main loop of the algorithm runs for \ V\ iterations. 
The test for a cycle in line 12 takes constant time, and the total time for the main loop is 
bounded by 0{E). The test for a cycle is achieved in constant time by using a special 
data structure proposed by Liao 
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3.3 Improvement over Solve-SOA 

Before suggesting the improvement, we want to point out two deficiencies in Solve- 
SOA. First, even though the edges are sorted in descending order of weight, the order 
of consideration of the edges of the same weight are ordered is not specified. We be- 
lieve this to be important in deriving the optimal solution. Second, the maximum weight 
edge is always selected since this is a greedy approach. The proposed Incremental- 
Solve-SOA heuristic addresses both these cases. This algorithm takes as input an offset 
sequence, produced either by Liao’s Solve-SOA or by some other means, and tries to 
include edges not previously included in the cover. Consider the example of Figure 0 
The code sequence is shown in Figure 0a) and the corresponding access sequence is 
in Figure 0b). The access graph which in turn corresponds to this access sequence is 
shown in Figure0c). Let us now run Liao’s Solve-SOA using the access sequence in 
Figure0b); one possible outcome is shown in Figure0a). The offset assignment asso- 
ciated with the cover is a, d, b, c, e or d, b, c, e, a. This is clearly a non-optimal solution. 
The cost of this assignment is 2. The optimal solution would be d, b, a, c, e. It is possible 
to have achieved the optimal cost of 1 by having considered either edge (a, b) or edge 
(a, c) before edge (6, c). But since Solve-SOA does not consider the relative position- 
ing of the edges of the same weight in the graph, we get the cost of 2. The solution 
that is produced by the proposed Incremental- Solve-SOA is d, b, a, c, e as shown in 
Figure|7|:b). 



d = a-i-b; 
e = b - c; 
a = c -I- 2; 

(a) 



abdbceca 

(b) 




(c) 



Fig. 6. An example where Solve-SOA could possibly return suboptimal results. 




Fig. 7. Suboptimal and optimal cover of G. 
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Fig. 8. Four different offset assignments. 



Incremental-Solve-SOA Figure^ shows the proposed algorithm. The algorithm picks 
the maximum weight edge not included In the cover and tries to include that. This is 
done as follows. Let the maximum weight edge not included in the cover be between 
two variables a(n) and a(n + x), in that order. We consider the case where we try to 
include that edge and see the effect on the cost of an assignment. There are four offset 
assignments when we try to bring two variables together not previously adjacent. The 
initial offset assignment is ...a{n — l)a{n)a{n+l)...a{n+x— l)a{n+x)a{n+x+l).... 
We consider the following four sequences that would result when edge {a{n)a{n + x)) 
is included in the cover: 

(1) ...a(n — l)a(n + x)a{n)a{n + l)...a(n + x — l)a(n + x + 1)... 

(2) ...a(n — l)a{n)a{n + x)a{n + l)...a{n + x — l)a{n + a: + 1)... 

(3) ...a(n — l)a{n + l)...a(n + x — l)a{n)a{n + x)a(n + a: + 1)... 

(4) ...a(n — l)a(n + l)...a(n + a; — l)a(n + x)a(n)a(n + a: + 1)... 

The cost of each of these is evaluated and the best assignment, i.e., the one with the least 
cost of the four is chosen for the next iteration. A running minimum cost assignment, 
BEST, is used to store the best assignment discovered. This is returned at the end of the 
procedure. 

Theorem 1 The Incremental-Solve-SOA will either improve or return the same cost 
assignment. 

Proof: As different assignments with different costs are produced, a running mini- 
mum is maintained. If the minimum is the initial assignment that is the one considered 
again, and finally returned when all the edges are locked, or there are no non-zero edges 
available for inclusion. □ 
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1 //INPUT : Access Sequence AS, Initial Offset Assignment, OA 

2 // OUTPUT : Final Offset Assignmnet 

3 Procedure lncremental-Solve-SOA(AS, OA) 

4 G = {V, E) -4— AccessGraph(AS) 

5 BEST -4— Initial offset assignment OA 

6 repeat 

7 -4— Sorted list of unselected edges from BEST configuration 

8 OUTER.FLAG 4- FALSE 

9 Unlock all edges in E^ort 

10 INNER.BEST 4- BEST 

1 1 repeat 

12 INNER.FLAG 4- FALSE 

13 e 4— topmost edge from E^^r, 

14 (Ao, ..., A 3 ) 4 — The four possible assignments due to e 

15 // An assignment is illegal if it involves changing a locked edge; 

16 // Otherwise, an assignment is legal 

17 S 4 — the set of legal assignments from (Ao, A3) 

18 if (S has at least one legal assignment) 

19 INNER.FLAG 4 - TRUE 

20 GURRENT 4 - MinCost(S') 

21 lock the edges that change 

22 Delete the locked edges from E^^t ensuring that E^^t stays sorted 

23 if(CostOf(CURRENT) < CostOf(INNER.BEST)) 

24 INNER.BEST 4 - GURRENT 

25 endif 

26 else (Es% yf <l>) 

27 INNER.FLAG 4 - TRUE 

28 endif 

29 until (INNER.FLAG / TRUE) 

30 if (CostOf(INNER.BEST) < CostOffBEST)) 

31 BEST 4 - INNER.BEST 

32 OUTER.FLAG 4- TRUE 

33 endif 

34 until (OUTER.FLAG / TRUE) 

35 return BEST 



Fig. 9. Incremental-Solve-SOA 



In the example, the initial assignment is d, b, c, e, a, that has a cost of 2. Let us 
try to include edge (b, a). The resulting four assignments for the initial assignment of 
d, b, c, e, a with cost = 2 are: 

(1) d, a, 6, c, e cost = 3 

(2) d, b, a, c, e cost = 1 

(3) d, c, e, 6, a cost = 4 

(4) d, c, e, a, b cost = 4 
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These four assignments are shown in Pigure|Htb)-(e). Figure |3a) shows the initial offset 
assignment for the purpose of comparison. 



Detailed Explanation of the Incremental-Solve-SOA The input is an access sequence 
and the initial offset assignment, that we will attempt to improve upon. The output is the 
possibly improved offset assignment. In line 4, we call the function AccessGraph to 
obtain the access graph from the access sequence. BEST is a data structure that stores 
an offset assignment along with its cost. It is initialized with the input offset assign- 
ment in line 5. Lines 6 thru 34 is the outer loop. The exit condition for this loop is the 
OUTER_FLAG being set to FALSE. Line 7 produces E^^^^ holds the sorted list of edges 
present in the access graph but not in the cover, in decreasing order of weight. This is 
done so as to be able to consider edges for inclusion in the order of decreasing weight. 
The edges carry a flag that is used for ‘locking’ or ‘unlocking’ edge. The ‘lock’ on an 
edge, if broken, is used to indicate an edge available for inclusion in the cover. Lines 
1 1 thru 29 form the inner loop. The exit condition for this loop the I N N ER_FLAG being 
FALSE. In line 13, the top most edge is extracted from E^^^j and the four assignments 
are produced as explained in the earlier section on reordering of variables. These four 
assignments are stored in (Aq, ..., A 3 ). The cover formed by each is checked to see if 
a locked edges not included earlier in being included, or if a locked edge included in 
earlier is being excluded. The assignments where this does not happen are included in 
S. This is done line 17. Of these the legal assignments are stored in the set S in line 17. 
In line 18 if there is at least one assignment available, set S is not empty, then the mini- 
mum cost one of those is assigned to CURRENT in line 20 and INNER_BEST is set to 
TRUE. The edges which undergo transitions as explained earlier are locked in line 2 1 . 
INNER_BEST maintains a running minimum cost assignment for the inner loop, and 
if the CURRENT cost is less than INNER_BEST, then that is made the INNER_BEST. 
E^o^( is reassigned for the list of unselected and unlocked edges from the CURRENT 
cover in line 22. If no legal assignments could be found for the edge extracted from 
E^q^j in line 17, and there is at least another edge available for consideration, then the 
INNER_FLAG is set to TRUE. This is done in lines 26 and 27. Once there are no more 
legal assignments and there are no more edges in E^^^^ available, we exit the inner loop 
and check if the cost of INNER_BEST is less than the BEST found. If there is an im- 
provement we perform the whole process of the inner loop all over again. This is made 
possible be setting OUTER_FLAG to TRUE. If no improvement was found, then we 
exit the outer loop too and the BEST offset assignment discovered is returned in line 
35. 

3.4 Complexity of Incremental-Solve-SOA 

As mentioned before, the running time of Liao’s S'o/t'e-SOA heuristic is 0(|i?| log |i?| + 
|L|), where \E\ is the number of edges in the access graph and \L\ is the length of 
the access sequence. The running time of the Incremental-Solve-SOA is 0{\E\) for 
the inner while loop. Sorting the edges in descending order of weight for takes 
0{\E\ log \E\) time, and the marking of the edges is 0{\E\), the number of iterations 
of the outer loop in our experience runs for a constant number of times, an average of 
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2. So, the complexity of each outer loop iteration in practice is 0{\E\ log \E\). This is 
the same as Liao’s, though in practice we need to incur a higher overhead; but, the use 
of our heuristic is justified by the fact that the code produced by this optimization will 
be executed many times whereas the compilation is done only once. Also, its use could 
possibly result in a smaller ROM area. 

4 Experimental Results 

We implemented both Solve-SOA and Incremental-Solve-SOA. The results shown 
in Table Q] are for the case where the initial offset assignment used was the result of 
using Liao’s heuristic (If the initial offset assignment is an unoptimized one, the im- 
provements will be much higher). 



Table 1. Results from Incremental-Solve-SOA as compared to Solve-SOA. 



Number of 


Size of 


% Cases 


% Improvement in tbe 


Variables 


Access Sequence 


Improved 


Improved Cases 


5 


10 


2.4 


37.08 


5 


20 


4.6 


19.00 


8 


16 


4.0 


17.13 


8 


30 


7.4 


8.71 


8 


40 


6.0 


6.85 


10 


50 


7.8 


4.87 


10 


100 


5.4 


2.41 


15 


20 


4.8 


12.66 


15 


30 


4.4 


7.46 


15 


56 


5.4 


3.29 


15 


75 


5.4 


2.37 


20 


40 


2.8 


5.38 


20 


75 


4.0 


2.71 


20 


100 


3.4 


1.92 



The experiments were performed by generating random access sequences and using 
these as input to Liao’s heuristic. The offset sequence returned was then used, in turn, 
along with the access sequence in the Incremental-Solve-SOA to produce a possible 
change in the offset sequence. This change is guaranteed to be the same or better as 
reflected in TableQ] The third column shows the percentage improvement in the number 
of cases is of relevance here, as it shows an improvement in the cost of the cover of 
the access graph. It is always possible to increase all the edge weights in the access 
graph by some constant value to achieve a higher magnitude improvement for the same 
change in cover, but the change in cover would still be the same. 

In Tabled the first column lists the number of variables, the second column lists 
the size of the access sequence. The third column shows the average improvement in 
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the number of cases. That is, for example in the first row, there was an improvement on 
an average of 2.4% of all the random access sequences considered. The fourth column 
shows, of the improved cases, the extant of improvement. For the first row that would 
be 37.08% improvement in 2.5% of the cases. 

The overall average improvement (in the third column) is 5.23%. This figure reflects 
the cases in which Incremental-Solve-SOA was able to improve upon the cover of 
the access graph given the offset assignment produced from Liao’s Solve-SOA as input. 
The improvement takes significance from the many times the code would be executed, 
and also that it would result in a saving of ROM area. 

5 Conclusions 

Optimal code generation is important for embedded systems in view of the limited area 
available for ROM and RAM. Small reductions in code size could lead to significant 
changes in chip area and hence reduction in cost. We looked at the Simple Offset As- 
signment (SOA) problem and proposed a heuristic which, if given as input, the offset 
assignment from Liao or some other algorithm will attempt to improve on that. It does 
this by trying to include the highest weighted edges not included in cover of an access 
graph. The proposed heuristic is quite simple and intuitive. Unlike algorithms that are 
used in different computer applications, it is possible to justify a higher running time 
for an algorithm designed for a compiler (especially for embedded systems), as it is 
run once to produce the code, which is repeatedly executed. In the case of embedded 
systems, there is the added benefit of savings in ROM area, possibly reducing the cost 
of the chip. 

In addition, the first author’s thesis m addressed two important issues: the first one 
is the use of commutative transformations to change the access sequence and thereby 
reducing the code size; the second deals with exploiting those cases where the post- 
increment or decrement value is allowed to be greater than one. We are currently ex- 
ploring several issues. First, we are looking at the effect of statement reordering on code 
density. Second, we are evaluating the effect of variable life times and static single as- 
signment on code density. In addition, reducing code density for programs with array 
accesses is an important problem. 
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Abstract. Irregular scientific codes experience poor cache performance 
due to their memory access patterns. In this paper, we examine two 
issues for locality optimizations for irregular computations. First, we 
experimentally find locality optimization can improve performance for 
parallel codes, but is dependent on the parallelization techniques used. 
Second, we show locality optimization may be used to improve perfor- 
mance even for adaptive codes. We develop a cost model which can be 
employed to calculate an efficient optimization frequency; it may be ap- 
plied dynamically instrumenting the program to measure execution time 
per time-step iteration. Our results are validated through experiments 
on three representative irregular scientific codes. 



1 Introduction 

As scientists attempt to model more complex problems, computations with ir- 
regular memory access patterns become increasingly important. These compu- 
tations arise in several application domains. In computational fluid dynamics 
(CFD), meshes for modeling large problems are sparse to reduce memory and 
computations requirements. In n-body solvers such as those arising in molecu- 
lar dynamics, data structures are by nature irregular because they model the 
positions of particles and their interactions. 

In modern microprocessors, memory system performance begins to dictate 
overall performance. The ability of applications to exploit locality by keeping 
references to cache becomes a major (if not the key) factor affecting perfor- 
mance. Unfortunately, irregular computations have characteristics which make 
it difficult to utilize caches efficiently. 

Consider the example in Figure 0 In the irregular code, accesses to x are 
irregular, dictated by the contents of the index array idx. It is unclear whether 
spatial locality exists or can be exploited by the cache. In the adaptive irregular 
code, the irregular accesses to x change as the program proceeds. While the 
program iterates the outer loop, the condition variable change will become true, 
then the index array idx will have different values, changing the access pattern 
to X in the inner loop. Changes in access patterns make locality optimizations 
more difficult. 

This research was supported in part by NSF CAREER Development Award 
#ASC9625531 in New Technologies, NSF CISE Institutional Infrastructure Award 
#CDA9401151, and NSF cooperative agreement ACI-9619020 with NPACI and NCSA. 

S.P. Midkiffet al. (Eds.): LCPC 2000, LNCS 2017, pp. 1 7.1- nTO 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 
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// Irregular // Adaptive irregular 



do t = 1 , time 
do i = 1, M 

. . . = x[idx[i]] 



do t = 1 , time 

if (change) idx[] = 
do i = 1, M 

. . . = x[idx[i]] 



Fig. 1. Regular and Irregular Applications 




Fig. 2. Graph Partitioning & Lexicographic Sort 



Researchers have demonstrated that the performance of irregular programs 
can be improved by applying a combination of computation and data layout 
transformations on irregular computations 0 in mniini Results have been 
promising, but a number of issues have not been examined. Our paper makes 
the following contributions: 

— Investigate the impact of locality optimizations on parallel programs. We 
find optimizations which yield better locality have greater impact on parallel 
performance than sequential codes, especially when parallelizing irregular 
reductions using local writes. 

— Devise a heuristic for applying locality optimizations to adaptive irregular 
codes. We find a simple cost model may be used to accurately predict a 
desirable transformation frequency. An on-the-fly algorithm can apply the 
cost model by measuring the per-iteration execution time before and after 
optimizations. 

The remainder of the paper begins with a discussion of our optimization 
framework. We then experimentally evaluate the effect of locality optimizations 
on parallel performance. Next, we evaluate the effect of adaptivity on locality 
optimizations, and provide both static and dynamic methods to apply locality 
optimizations based on a simple cost model. Finally, we conclude with a discus- 
sion of related work. 
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// 1 access 
global x(N) 
global idxl (M) 



// 2 accesses 
global x(N) 

global idxl (M) , idx2 (M) 



do i = 1, M 

... = x(idxl (i) ) 



do i = 1,M 

... = xCidxl (i) ) 
... = x(idx2(i) ) 



Fig. 3. Access Pattern of Irregular Computations 



2 Locality Optimizations 

2.1 Data & Computation Transformations 

Irregular applications frequently suffer from high cache miss rates due to poor 
memory access behavior. However, program transformations may exploit dor- 
mant locality in the codes. The main idea behind these locality optimizations is 
to change computation order and or data layout at runtime, so that irregular 
codes can access data with more temporal and spatial locality. 

Figure 0 gives an example of how program transformations can improve lo- 
cality. Circles represent computations (loop iterations), squares represent data 
(array elements), and arrows represent data accesses. Initially memory accesses 
are irregular, but computation and/or data may be reordered to improve tem- 
poral and spatial locality. 

Fortunately, irregular scientific computations are typically composed of loops 
performing reductions such as SUM and MAX, so loop iterations can be safely 
reordered 0 0- Data layouts can be safely modified as well, as long as all 
references to the data element are updated. Such updates are straightforward 
unless pointers are used. 

2.2 Framework and Application Structure 

A key decision is when computation and data reordering should be applied. For 
locality optimizations, we believe the number of distinct irregular accesses made 
by each loop iteration can be used to choose the proper optimization algorithm. 
Figure 13 shows examples of different access patterns. Codes may access either a 
single or multiple distinct elements of the array x on each loop iteration. 

In the simplest case, each iteration makes only one irregular access to each 
array, as in the first code of Figure 0 The NAS integer sort (IS) and sparse 
matrix vector multiplication found in conjugate gradient (CG) fall into this cat- 
egory. Locality can be optimally improved by sorting computations (iterations) 
in memory order of array x El- Sorting computations may also be virtually 
achieved by sorting index array idxl. 

More often, scientific codes access two or more distinct irregular references on 
each loop iteration. Such codes arise in PDF solvers traversing irregular meshes 
or N-body simulations, when calculations are made for pair of points. An ex- 
ample is shown in the second code of Figured Notice that since each iteration 
accesses two array elements, computations can be viewed as edges connecting 
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data nodes, resulting in a graph as in Figure |21 Locality optimizations can then 
be mapped to a graph partitioning problem. Partitioning the graph and putting 
nodes in a partition close in memory can then improve spatial and temporal 
locality. Applying lexicographic sorting after partitioning captures even more 
locality. Finding optimal graph partitions is NP-hard, thus several optimization 
heuristics have been proposed 0 □ El El In our evaluation, we use two sim- 
ple traversing algorithms called Reverse Cuthill-Mckee (rcm) and Consecutive 
Packing (cpack) . We also use three graph partitioning algorithms called Recur- 
sive Coordinate Bisection (rcb), multi-level graph partitioning (metis), and low 
overhead graph partitioning (gpart). 



3 Evaluation of Locality Optimization 

3.1 Experimental Evaluation Environment 

Our experimental results are obtained on a Sun HPCIOOOO which has 400 MHz 
UltraSparc II processors with 16K direct-mapped LI and 4M direct-mapped L2 
caches. Our prototype compiler is based on the Stanford SUIF compiler [13. It 
can identify and parallelize irregular reductions using pthreads, but does not yet 
generate inspectors like Chaos 0- As a result, we currently insert inspectors by 
hand for both the sequential and parallel versions of each program. 

We examine three irregular applications, IRREG, NBF, and moldyn. Each 
application contains an initialization section followed by the main computation 
enclosed in a sequential time step loop. Statistics and timings are collected after 
the initialization section and the first iteration of the time step loop, in order to 
more closely match steady-state execution. 

IRREG is a representative of iterative partial differential equation (PDE) 
solvers found in computational fluid dynamics (CFD) applications. In such codes, 
unstructured meshes are used to model physical structures, nbf is a kernel ab- 
stracted from the GROMOS molecular dynamics code, nbf computes a force 
between two molecules and applied to velocities of both molecules. MOLDYN is 
abstracted from the non-bonded force calculation in CHARMM, a key molecular 
dynamics application used at NIH to model macromolecular systems. 

To test the effects of locality optimizations, we chose a variety of input data 
meshes, foil and auto are 3D meshes of a parafoil and GM Saturn automobile, 
respectively. The ratios of edges to nodes are between 7-10 for these meshes. 
molI and MOl2 are small and large 3D meshes derived from semi-uniformly 
placed molecules of moldyn using a 1.5 angstrom cutoff radius. We applied these 
meshes to IRREG, nbf, and moldyn to test the locality effects. All meshes are 
initially sorted, so computation reordering is not required originally, but compu- 
tation reordering is applied only after data reordering techniques are used, since 
data reordering makes computations out of order, foil and molI roughly have 
140K nodes and auto and MOl2 have roughly 440K nodes. Their edges / nodes 
ratios are 7-9. 
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Overhead (SUN UltraSparc II) 


□ RGB 
H METIS 
■ GPART 

□ RCM 


1 
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foil auto moll mol2 mol1 mol2 

IRREG IRREG NBF NBF MOLDYN MOLDYN 



Fig. 4. Overhead of Data Reordering (relative to 1 iteration of ORIG) 



3.2 Overhead of Optimizations 

Each locality optimization has relative processing overhead. Figure 0 displays 
the costs of data reordering techniques measured relative to the execution time 
of one iteration of the time step loop. 

The overhead includes the cost to update edge structures and transform other 
related data structures to avoid the extra indirect accesses caused by the data 
reordering. The overhead also includes the cost of computation reordering. 

The least expensive data layout optimization is CRACK. RCM has almost same 
overhead as CRACK. In comparison, partitioning algorithms are quite expensive 
when used for cache optimizations. RGB and metis are quite expensive when 
used for cache optimizations, on the order of 5-45 times higher than CRACK. 
The overhead of GRART is much less than metis and RGB, but 3-5 times higher 
than CRACK. 



3.3 Parallelizing Irregular Codes 

Another issue we consider is the effect of locality optimizations on parallel execu- 
tion. We find parallel performance is also improved, but the impact is dependent 
on the parallelization strategy employed by the application. We thus first briefly 
review two parallelization strategies. 

The core of irregular scientific applications is frequently comprised of reduc- 
tions, associative computations (e.g., SUM, max) which may be reordered and 
parallelized m. Compilers for shared-memory multiprocessors generally paral- 
lelize irregular reductions by having each processor compute a portion of the 
reduction, storing results in a local replicated buffer. Results from all replicated 
buffers are then combined with the original global data, using synchronization 
to ensure mutual exclusion UEM 

An example of the RerlicateBufs technique is shown in Figure 0 If large 
replicated buffers are to be combined, the compiler can avoid serialization by 
directing the run-time system to perform global accumulations in sections using 
a pipelined, round-robbin algorithm |^. RerlicateBufs works well when the 
result of the reduction is to a scalar value, but is less efficient when the reduction 
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global X [nodes] ,y [nodes] 
local ybuf [nodes] 

do t = // time-step loop 

ybuf [] =0 // init local buffer 

do i = {my_edges]- // local computation 
n = idxl [i] 
m = idx2[i] 
force = f(x[m], x[n]) 

ybuf [n] += force // updates stored in 
ybuf [m] += -force // replicated ybuf 
reduce_sum(y , ybuf) // combine buffers 

Fig. 5. ReplicateBufs Example 



global X [nodes] ,y [nodes] 

inspect (idxl , idx2) // calc local_edges/cut_edges 
do t = // time-step loop 

do i = {local_edges} // both LHS’s are local 
n = idxl [i] 
m = idx2[i] 
force = f(x[m], x[n]) 

y[n] += force 

y[m] += -force 

do i = {cut_edges} // one LHS is local 
n = idxl [i] 
m = idx2[i] 

force = f(x[m], x[n]) // replicated compute 
if(own(y[n]) y[n] += force 
if(own(y[m]) y[m] += -force 

Fig. 6. LocalWrite Inspector Example 



is to an array, since the entire array is replicated and few of its elements are 
effectively used. 

An alternative method is LocalWrite, a compiler and run-time technique 
for parallelizing irregular reductions we previously developed [Z|. LocalWrite 
avoids the overhead of replicated buffers and mutual exclusion during global 
accumulation by partitioning computation so that each processor only computes 
new values for locally-owned data. It simply applies to irregular computations the 
owner- computes rule used in distributed-memory compilers m - LocalWrite 
is implemented by having the compiler insert inspectors to ensure each processor 
only executes loop iterations which write to the local portion of each variable. 
Values of index arrays are examined at run time to build a list of loop iterations 
which modifies local data. 

An example of LocalWrite is shown in Figure 0 Computation may be 
replicated whenever a loop iteration assigns the result of a computation to data 
belonging to multiple processors {cut edge). The overhead for LocalWrite 
should be much less than classic inspector /executors, because the LocalWrite 
inspector does not build communication schedules or perform address transla- 
tion. Besides, LocalWrite does not perform global accumulation for the non- 
local data. Instead, LocalWrite replicates computation, avoiding expensive 
communications across processors. 
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Fig. 7. Speedups on HPC 10000 (parallelized with ReplicateBufs) 



The LocalWrite algorithm inspired our techniques for improving cache lo- 
cality for irregular computations. Conventional compiler analysis cannot analyze, 
much less improves locality of irregular codes because the memory access pat- 
terns are unknown at compile time. The lightweight inspector in LocalWrite, 
however, can reorder the computations at run time to enforce local writes. It is 
only a small modification to change the inspector to reorder the computations 
for cache locality as well as local writes. We can use all of the existing compiler 
analysis for identifying irregular accesses and reductions (to ensure reordering is 
legal). 

Though targeting parallel programs, our locality optimizations preprocess 
data sequentially. If overhead is too high, parallel graph partitioning algorithms 
may be used to reduce overhead for parallel programs. 



3.4 Impact on Parallel Performance 

Figure 0 ^^nd Figure 0 display 8-processor Sun HPCIOOOO speedups for each 
mesh, calculated versus the original, unoptimized program. Figured shows speed- 
ups when applications are parallelized using ReplicateBufs and FigureElshows 
speedups when LocalWrite is used. We include overheads to show how per- 
formance changes as the total number of iterations executed by the program 
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Fig. 8. Speedups on HPC 10000 (parallelized with LocalWrite) 



increases. We see that when a sufficient number of iterations is executed, the 
versions optimized for locality achieved better performance. Locality optimiza- 
tions thus also improve performance of parallel versions of each program. 

In addition, we found that with locality optimizations, programs parallelized 
using LocalWrite achieved much better speedups than the original programs 
using ReplicateBufs. The LocalWrite algorithm tends to be more effective 
with larger graphs where duplicated computations are relatively few compared 
to computations performed locally. In general, the LocalWrite algorithm ben- 
efited more from the enhanced locality. Intuitively these results make sense, since 
the LocalWrite optimization can avoid replicated computation and commu- 
nication better when the mesh displays greater locality. 

Results show higher quality partitions become more important for parallel 
codes. Optimizations such as CPACK and rcm which yield almost the same im- 
provements as RCB and metis for uniprocessors |H| perform substantially worse 
on parallel codes. In comparison, GPART achieves roughly the same performance 
as the more expensive partitioning algorithms, even for parallel codes. 
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Fig. 9. Performance Changes in Adaptive Computations (overhead excluded) 



4 Optimizing Adaptive Computations 

A problem confronting locality optimizations for irregular codes is that many 
such applications are adaptive, where the data access pattern may change over 
time as the computation adapts to data. For instance, the example in Figure^is 
adaptive, since condition change may be satisfied on some iterations of the time- 
step loop, modifying elements of the index arrays idx. Fortunately, changing data 
access patterns reduces locality and degrades performance, but does not affect 
legality. Locality optimizations are not reapplied after every change, but only 
when it is deemed profitable. 



4.1 Impact of Adaptation 

To evaluate the effect of adaptivity on performance after computation/data re- 
ordering, we timed MOLDYN with input molI, periodically swapping 10% of the 
molecules randomly every 20 iterations to create an adaptive code. We realize 
this is not realistic, but it provides a testbed for our experiments. Adaptive 
behavior in moldyn is actually dependent on the initial position and velocity 
of molecules in the input data. Future research will need to conduct a more 
comprehensive study of adaptive behavior in scientific programs. 

Results for our synthetic adaptive code is shown in Figure M In the first 
graph, the x-axis marks the passage of time in the computation, while the y- 
axis measures execution time. ORIG is the original program. RGB, metis, GPART, 
and CRACK represent versions of the program where data and computation are 
reordered for improved locality exactly once at the beginning of program. In 
comparison, RCB-a, METiS-a, GRART-a, and CRACK-a represent the versions 
where data and computation are reordered whenever access patterns change. 
Data points represent the execution times for every 20 iterations of the pro- 
gram. Results show that without reapplying locality reordering, performance 
degrades and eventually matches with the unoptimized program. In compari- 
son, Reapplying partitioning algorithms after every access pattern change can 
preserve the performance benefits of locality, if overhead is excluded. 

In practice, however, we do need to take into account overhead. By periodi- 
cally applying reordering, we can maximize the net benefit of locality reordering. 
Performance changes with periodic reordering are shown in the second graph in 
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a : time/iteration for original code 
b : time/iteration for optimized code 
right after transformation 
n : number of transformations 
applied during t iterations 
m : tangent for optimized code 
t : total iterations 
r : percentage of nodes changed 



Fig. 10. Analytical Model for Adaptive Computations 
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Fig. 11. Calculating Net Gain and Adaptation Frequency 



Figure 0 We re-apply CPACK every 40 iterations, and RGB and GPART every 60 
iterations. Results show performance begins to degrade, but recover after local- 
ity optimizations are reapplied. Even after overheads are included, the overall 
execution time is improved over not reapplying locality optimizations. The key 
question is what frequency of reapplying optimizations results in maximum net 
benefit. In the next section we attempt to calculate the frequency and benefit of 
reordering. 



4.2 Cost Model for Optimizations 

To guide locality transformations for adaptive codes, we need a cost model to 
predict the benefits of applying optimizations. We begin by showing how to 
calculate costs when all parameters such as optimization overhead, optimization 
benefit, access change frequency, and access change magnitude are known. Later 
we show how this method may be adopted to work in practice by collecting 
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Fig. 12. Experimental Verification of Cost Model (vertical bars represent the 
numbers chosen by the cost model) 



and fixing some parameters and gathering the remaining information on-the-fly 
through dynamic instrumentation. 

First we present a simple cost model for computing the benefits of locality 
reordering when all information is available. We can use it to both predict im- 
provements for different optimizations and decide how frequently locality trans- 
formations should be reapplied. 

We consider models for the performance of two versions of a program, original 
code and optimized code. Figure mi plots execution time per iteration, excluding 
the overhead. The upper straight line corresponds to the original code and the 
lower saw-tooth line corresponds to the optimized code. For the original code, 
we assume randomly initialized input data, which makes execution times per 
iteration stay constant after node changes. For the optimized code, locality opti- 
mization is performed at the beginning and periodically reapplied. For simplicity, 
we assume execution times per iteration increase linearly as the nodes change 
and periodically drop to the lowest (optimized) point after reordering is reap- 
plied. Execution times per iteration do not include the overhead of the locality 
reordering, but we will take it into account later in our benefit calculation. 
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do t = 1 , n 



if (change) 
idxl [] = 

access_changed() 



// time-step loop 
// change accesses 



cost_model () 



// note access changes 
// apply cost model 



do i = 1 , edges // main computation 



Fig. 13. Applying Cost Model Dynamically 



Performance degradation rate (m) of the optimized code is set to r{a — b), 
where r is the percentage of changed nodes. For example, if an adaptive code 
randomly changes 10% of nodes each iteration (r = 0.1), then the execution 
time per iteration becomes that of the original code after 10 iterations. The net 
performance gain (G{n)) from periodic locality transformations can be calcu- 
lated as in Figure Since the area below the line is the total execution time, 
the net performance gain will the area between two lines (A) minus the total 
overhead (nOy). Taking the differential equation of G{n), we can calculate the 
point where the net gain is maximized. In practice, data access patterns often 
change periodically instead of after every iteration, since scientists often accept 
less precision for better execution times. We can still directly apply our model 
in such cases, assuming one iteration in our model corresponds to a set of loop 
iterations where data access patterns change only on the last iteration. 

To verify our cost model, we ran experiments with three kernels on a DEC 
Alpha 21064 processor. The kernels iterate 240 time steps, randomly swapping 
10% of nodes every 20 iterations. We will use these adaptive programs for all the 
remaining experiments for adaptive codes. Results are shown in Figure El The 
y-axis represents the percentage of net performance gain over original execution 
time. The x-axis represents the number of transformations applied throughout 
the 240 time steps. Different curves correspond to different locality reordering, 
varying numbers of transformations applied. The vertical bars represent the num- 
bers of transformations chosen by our cost model and the percentage numbers 
under the bars represent the predicted gain. The optimization frequency calcu- 
lated by the cost model is not an integer and needs to be rounded to an integer in 
practice. Nonetheless, results show our cost model predicts precise performance 
gains, and always selects nearly the maximal point of each measured curve. 

4.3 On-the-Fly Application of Cost Model 

Though experiments show the cost model introduced in the previous section is 
very precise, it requires several parameters to calculate the frequency and net 
benefit of locality optimizations. In practice, some of the information needed 
is either expensive to gather (e.g., overhead for each optimization) or nearly 
impossible to predict ahead of time (e.g., rate of change in data access patterns). 

In this section, we show how a simplified cost model can be applied by gath- 
ering information on-the-fly by inserting a few timing routines in the program. 
First, we apply only GPART as locality reordering. Thus, we only calculate the 
frequency of reordering, not the expected gain. Since GPART has low overhead 
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IRREG 


NBF 


MOLDYN 


FOIL 


AUTO 


MOLl 


mol2 


MOLl 


mol2 


Number of Transformations 
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4 


4 


4 


4 


6 


Measured Gain (%) 


20.7 


36.0 


27.1 


41.5 


22.1 


33.9 



Table 1. Performance Improvement for Adaptive Codes with On-the-fly Cost 
Model 



but yields high quality ordering, it should be suitable for adaptive computations. 
Second, performance degradation rate (m) is gathered at run time (instead of 
calculating from the percentage of node changes) and adjusted by monitoring 
performance (execution time) of every iteration. The new rate is chosen so that 
actual elapsed time since the last reordering is the same as calculated elapsed 
time under the newly chosen linear rate. 

Figure O shows an example code that uses on-the-fly cost model. At every 
iteration, cost_model() is called to calculate a frequency based on the informa- 
tion monitored so far. In our current setup, the cost model does not monitor the 
first iteration to avoid noisy information from cold start. On the second itera- 
tion, it applies GPART and monitors the overhead and the optimized performance 
(6). In the third iteration, it does not apply reordering and monitors the perfor- 
mance degradation. The cost model now has accumulated enough information to 
calculate the desired frequency of reapplying locality optimizations. From this 
point on, the cost model keeps monitoring performance of each iteration, ad- 
justing degradation rate and produces new frequencies based on accumulated 
information. When the desired interval is reached, it re-applies GPART. 

Since irregular scientific applications may change mesh connections period- 
ically (e.g., every 20 iterations), an optimization for the on-the-fly cost model 
is detecting periodical access pattern changes. Rounding up the predicted opti- 
mization frequency so transformations are performed immediately after the next 
access pattern change can then fully exploit the benefit of locality reordering. 
For this purpose, the compiler can insert a call to access-changed() as shown in 
FigureEl This function call notifies the cost model when access pattern changes. 
The cost model then decides whether access patterns periodically changes. 

To evaluate the precision of our on-the-fly cost model, we performed ex- 
periments with the three kernels under the same situation. Table Q shows the 
performance improvements over original programs that do not have locality re- 
ordering. The improvements closely match with the maximum improvements in 
Figure la and the numbers of transformations also match with the numbers in 
periodic reordering. Our on-the-fly cost model thus works well in practice with 
limited information. 



5 Related Work 

Researchers have investigated improving locality for irregular scientific appli- 
cations. Das et al. investigated data/computation reordering for unstructured 
euler solvers |S|. They combined data reordering using RCM and lexicographical 
sort for computation reordering, to improve performance of parallel codes on an 
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Intel iPSC/860. Al-Furaih and Ranka studied partitioning data using metis and 
BPS to reorder data in irregular codes PJ. They conclude metis yields better 
locality, but they did not include computation reordering. 

Ding and Kennedy explored applying dynamic copying (packing) of data 
elements based on loop traversal order, and show major improvements in per- 
formance 1^. They were able to automate most of their transformations in a 
compiler, using user provided information. For adaptive codes they re-apply 
transformations after every change. In comparison, we show partitioning algo- 
rithms can yield better locality, albeit with higher processing overhead. We also 
develop a more sophisticated on-the-fly algorithm for deciding when to re-apply 
transformations for adaptive codes. Ding and Kennedy also developed algorithms 
for reorganizing single arrays into multi-dimensional arrays depending on their 
access patterns j5] . Their technique might be useful for Fortran codes where data 
are often single dimension arrays, not structures as in C codes. 

Mellor-Crummey et al. used a geometric partitioning algorithm based on 
space-filling curves to map multidimensional data to memory El- In comparison, 
our graph-based partitioning techniques are more suited for compilers, since 
geometric coordinate information for space-filling curves do not need to provided 
manually. When coordinate information is available, using RGB is better because 
space-filling curves cannot guarantee evenly balanced partition when data is 
unevenly distributed, which may cause significant performance degradation in 
parallel execution. 

Mitchell et al. improved locality using bucket sorting to reorder loop itera- 
tions in irregular computations H2|. They improved the performance of two NAS 
applications (CG, and IS) and a medical heart simulation. Bucket sorting works 
only for computations containing a single irregular access per loop iteration. 
In comparison, we investigate more complex cases where two or more irregular 
access patterns exist. For simple codes lexicographic sort yields improvements 
similar to bucket sorting. 

Researchers have previously proposed on-the-fly algorithms for improving 
load balance in parallel systems, but we are not aware of any such algorithm 
for guiding locality optimizations. Bull uses a dynamic system to improve load 
balance in loop scheduling |2|. Based on the previous execution time of each 
parallel loop iteration roughly equal amounts of computation are assigned. 

Nicol and Saltz investigated algorithms to remap data and computation for 
adaptive parallel codes to reduce load imbalance ng. They use a dynamic heuris- 
tic which monitors load imbalance on each iteration, predicting time lost to load 
imbalance. Data an computation is remapped using a greedy stop-on-rise heuris- 
tic, when a local minima is reached in predicted benefit. We adapt a similar 
approach for our on-the-fly optimization technique, but use it to guide transfor- 
mations to improve locality instead of load imbalance. 



6 Conclusions 

In this paper, we propose a framework that guides how to apply locality op- 
timizations according to the application access pattern. We find locality opti- 
mizations also improve performance for parallel codes, especially when combined 
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with parallelization techniques which benefit from locality. We show locality op- 
timizations may be used to improve performance even for adaptive codes, using 
an on-the-fly cost model to select for each optimization how often to reorder 
data. 

As processors speed up relative to memory systems, locality optimizations for 
irregular scientific computations should increase in importance, since processing 
costs go down while memory costs increase. For very large graphs, we should 
also obtain benefits by reducing TLB misses and paging in the virtual memory 
system. By improving compiler support for irregular codes, we are contribut- 
ing to our long-term goal: making it easier for scientists and engineers to take 
advantage of the benefits of high-performance computing. 
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Abstract. This paper proposes a simple and efficient implementation 
method for a hierarchical coarse grain task parallel processing scheme on 
a SMP machine. OSCAR multigrain parallelizing compiler automatically 
generates parallelized code including OpenMP directives and its perfor- 
mance is evaluated on a commercial SMP machine. The coarse grain task 
parallel processing is important to improve the effective performance of 
wide range of multiprocessor systems from a single chip multiprocessor 
to a high performance computer beyond the limit of the loop parallelism. 
The proposed scheme decomposes a Fortran program into coarse grain 
tasks, analyzes parallelism among tasks by “Earliest Executable Con- 
dition Analysis” considering control and data dependencies, statically 
schedules the coarse grain tasks to threads or generates dynamic task 
scheduling codes to assign the tasks to threads and generates OpenMP 
Fortran source code for a SMP machine. The thread parallel code using 
OpenMP generated by OSCAR compiler forks threads only once at the 
beginning of the program and joins only once at the end even though the 
program is processed in parallel based on hierarchical coarse grain task 
parallel processing concept. The performance of the scheme is evaluated 
on 8-processor SMP machine, IBM RS6000 SP 604e High Node, using a 
newly developed OpenMP backend of OSCAR multigrain compiler. The 
evaluation shows that OSCAR compiler with IBM XL Fortran compiler 
version 5.1 gives us 1.5 to 3 times larger speedup than the native XL For- 
tran compiler for SPEC 95fp SWIM, TOMCATV, HYDR02D, MGRID 
and Perfect Benchmarks ARC2D. 



1 Introduction 

The loop parallelization techniques, such as Do-all and Do-across, have been 
widely used in Fortran parallelizing compilers for multiprocessor systems E| • 
Currently, many types of Do-loop can be parallelized with various data de- 
pendency analysis techniques|3l Ej such as GCD, Benerjee’s inexact and ex- 
act testsPlE], OMEGA test 0, symbolic analysis jOj, semantic analysis and dy- 
namic dependence test and program restructuring techniques such as array 
privatization [ 7 ], loop distribution, loop fusion, strip mining and loop interchange 

0 E] 

S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 189-EJ7] 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 
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For example, Polaris compiler ^3 ^3 exploits loop parallelism by using 
inline expansion of subroutine, symbolic propagation, array nrivatization |71 fTT| 
and run-time data dependence analysis P21- SUIF compiler parallelizes loops 
by using inter-procedure analysis ^ (TK) . unimodular transformation and 
data locality ontimization |1 61 II 7j . Effective optimization of data localization is 
getting more important because of the increasing disparity between memory 
and processor speeds. Currently, many researches for data locality optimization 
using program restructuring techniques such as blocking, tiling, padding and 
data localization, are proceeding for high performance computing and single 
chip multiprocessor systems CEimiiEni- 

However, these compilers cannot parallelize loops that include complex loop 
carrying dependences and conditional branches to the outside of a loop. 

Considering these facts, the coarse grain task parallelism should be exploited 
to improve the effective performance of multiprocessor systems further in addi- 
tion to the improvement of data dependence analysis, speculative execution and 
so on. 

PRO MIS compiler |22| hierarchically combines Parafrase2 compiler |23| 
using HTGE3I and symbolic analysis techniques [Hj and EVE compiler for fine 
grain parallel processing. NANOS compiler ESI based on Parafrase2 has been 
trying to exploit multi-level parallelism including the coarse grain parallelism 
by using extended OpenMP API^,|^. OSCAR compiler has realized a multi- 
grain parallel processing jHEOlEII that effectively combines the coarse grain 
task parallel processing which can be applied for a single 

chip multiprocessor to HPC multiprocessor systems, the loop parallelization and 
near fine grain parallel processing |d5|. In OSCAR compiler, coarse grain tasks 
are dynamically scheduled onto processors or processor clusters to cope with the 
runtime uncertainties caused by conditional branches by dynamic scheduling 
routine generated by the compiler. As the embedded dynamic task scheduler, 
the centralized dynamic scheduler |SDl in OSCAR Fortran compiler and the 
distributed dynamic scheduler j, ‘IB] have been proposed. 

A coarse grain task assigned to a processor cluster is processed in parallel by 
processors inside the processor cluster with the use of the loop, the coarse grain 
and near fine grain parallel processing hierarchically. 

This paper describes the implementation scheme of a coarse grain task par- 
allel processing on a commercially available SMP machine and its performance. 
Ordinary sequential Fortran programs are parallelized using by OSCAR com- 
piler automatically and a parallelized program with OpenMP API is generated. 
In other words, OSCAR Fortran Compiler is used as a preprocessor which trans- 
forms a Fortran program into a parallelized OpenMP Fortran realizing static 
scheduling and centralized and distributed dynamic scheduling for coarse grain 
tasks depending on parallelism of the source program and performance parame- 
ters of the target machines. Parallel threads are forked only once at the beginning 
of the program and joined only once at the end to minimize fork/join overhead. 
Though OpenMP API is chosen as the thread creation method because of the 
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portability, the proposed implementation scheme can be used for other thread 
generation method as well. 

The performance of the proposed coarse grain task parallel processing in 
OSCAR multigrain compiler is evaluated on IBM RS6000 SP 604e High Node 8 
processors SMP machine. 

In the evaluation, OSCAR multigrain compiler automatically generates coarse 
grain parallel processing codes using a subset of OpenMP directives supported 
by IBM XL Fortran version 5.1. The codes are compiled by XL Fortran and 
executed on 8 processors of RS6000 SP 604e High Node. 

The rest of this paper is composed as follows. Section 2 introduces the 
coarse grain task parallel processing scheme. Section 3 shows the implementation 
method of the coarse grain task parallelization on a SMP. Section 4 evaluates 
the performance of this method on IBM RS6000 SP 604e High Node for several 
programs like Perfect Benchmarks and SPEC 95fp Benchmarks. 

2 Coarse Grain Task Parallel Processing 

Coarse grain task parallel processing uses parallelism among three kinds of 
macro-tasks, namely, Basic Block(BB), Repetition Block(RB or loop) and Sub- 
routine Block(SB). Macro-tasks are generated by decomposition of a source pro- 
gram and assigned to processor clusters or processor elements and executed in 
parallel inter and/or intra processor clusters. 

The coarse grain task parallel processing scheme in OSCAR multigrain au- 
tomatic parallelizing compiler consists of the following steps. 

1. Generation of macro-tasks from a source code of an ordinary sequential pro- 
gram 

2. Generation of Macro-Flow Graph which represents a result of data depen- 
dency and control flow analysis among macro-tasks. 

3. Generation of Macro-Task Graph by analysis of parallelism among macro- 
tasks using Earliest Executable Condition analysis . 

4. If a macro-task graph has only data dependency edges, macro-tasks are 
assigned to processor clusters or processor elements by static scheduling. 
If a macro-task graph has both data dependency and control dependency 
edges, macro-tasks are assigned to processor clusters or processor elements 
at runtime by dynamic scheduling routine generated and embedded into the 
parallelized user code by the compiler. 

In the following, these steps are briefly explained. 



2.1 Generation of Macro-tasks 

In the coarse grain task parallel processing, a source program is decomposed into 
three kinds of macro-tasks, namely, Basic Block(BB), Repetition Block(RB) and 
Subroutine Block (SB) as mentioned above. 
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Fig. 1. Macro flow graph and macro-task graph 



If there is a prallelizable loop, it is decomposed into smaller loops in the 
iteration direction and the decomposed partial loops are deflned as different 
macro-tasks. The number of decomposed loops is decided considering the number 
of processor clusters or processor elements and cache or memory size. 

RBs composed of a sequential loops having large processing cost and SBs, to 
which inline expansion can not be applied effectively, are decomposed into sub 
macro-tasks and the hierarchical coarse grain task parallel processing is applied 
as shown in Fig El explained later. 



2.2 Generation of Macro-flow Graph 

Next, the date dependency and control flow among macro-tasks for each nest 
level are analyzed hierarchically. The control flow and data dependency among 
macro-tasks are represented by macro-flow graph as shown in FigEKa). 

In the figure, nodes represent macro-tasks, solid edges represent data depen- 
dencies among macro-tasks and dotted edges represent control flow. A small cir- 
cle inside a node represents a conditional branch inside the macro-task. Though 
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arrows of edges are omitted in the macro-flow graph, it is assumed that the 
directions are downward. 



2.3 Generation of Macro-task Graph 

Though the generated macro-flow graph represents data dependencies and con- 
trol flow, it does not represent parallelism among macro-tasks. To extract paral- 
lelism among macro-tasks from macro-flow graph. Earliest Executable Condition 
analysis considering data dependencies and control dependencies is applied. Ear- 
liest Executable Condition represents the conditions on which macro-task may 
begin its execution earliest. It is obtained assuming the following conditions. 

1. If Macro- Task(MT)i data-depends on MTj, MT* cannot begin execution 
before MTj finishes execution. 

2. If the branch direction of MTj is determined, MTi that control-depends on 
MTj can begin execution even though MTj has not completed its execution. 

Then, the original form of Earliest Executable Condition is represented as 
follows; 

(Macro- Task(MT)j, on which MTf is control dependent, 
takes a branch that guarantees MTi will execute) 

AND 

(MTfc(0<fc<|fV|), on which MTf is data dependent, completes execution 
OR it is determined that MTfc is not be executed), 
where N is the number of predecessors of MTf 

For example, the original form of Earliest Executable Condition of MT6 on 
Fig^b) is 



(MTl takes a branch that guarantees MTS will be execute 
OR MT2 takes a branch that guarantees MT4 will be execute) 

AND 

(MTS completes execution 

OR MTl takes a branch that guarantees MT4 will be execute). 

However, the completion of MTS means MTl already took the branch to 
MTS. Also, “MT2 takes a branch that guarantees MT4 will execute” means 
that MTl already branched to MT2. Therefore, this condition is redundant and 
its simplest form is 



(MTS completes execution 

OR MT2 takes a branch that guarantees MT4 will execute). 

Earliest Executable Condition of macro-task is represented in a macro-task 
graph as shown in FigC^b). 

In the macro-task graph, nodes represent macro-tasks. A small circle inside 
nodes represents conditional branches. Solid edges represent data dependencies. 
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Dotted edges represent extended control dependencies. Extended control depen- 
dency means ordinary normal control dependency and the condition on which a 
data dependence predecessor of MTt is not executed. 

Solid and dotted arcs connecting solid and dotted edges have two different 
meanings. A solid arc represents that edges connected by the arc are in AND 
relationship. A dotted arc represents that edges connected by the arc are in OR 
relation ship. 

In MTG, though arrows of edges are omitted assuming downward, an edge 
having arrow represents original control flow edges, or branch direction in macro- 
flow graph. 

2.4 Generation of Scheduling Routine 

In the coarse grain task parallel processing, the dynamic scheduling and the 
static scheduling are used for assignment of macro-tasks to processor clusters or 
processor elements. In the dynamic scheduling, MTs are assigned to processor 
clusters or processor elements at runtime to cope with runtime uncertainties 
like conditional branches. The dynamic scheduling routine is generated and em- 
bedded into user program by compiler to eliminate the overhead of OS call for 
thread scheduling. 

Though generally dynamic scheduling overhead is large, in OSCAR compiler 
the dynamic scheduling overhead is relatively small since it is used for the coarse 
grain tasks assignment. There are two kinds of schemes for dynamic scheduling, 
namely, a centralized dynamic scheduling, in which the scheduling routine is 
executed by a processor element, and a distributed scheduling, in which the 
scheduling routine is distributed to all processors. 

Also, in static scheduling, assignment of macro-tasks to processor clusters or 
processor elements is determined at compile-time inside auto parallelize com- 
piler if macro-task graph has only data dependency edges. Static scheduling is 
useful since it allows us to minimize data transfer and synchronization overheard 
without run-time scheduling overhead. 



3 Implementation of Coarse Grain Task Parallel 
Processing Using OpenMP 

This section describes an implementation method of the coarse grain task parallel 
processing using OpenMP for SMP machines. 

Though macro-tasks are assigned to processor clusters or processor elements 
in the coarse grain task parallel processing in OSCAR compiler, OpenMP only 
supports the thread level parallel processing. Therefore, the coarse grain parallel 
processing is realized by corresponding a thread to a processor element, and a 
thread group to a processor cluster. 

Though OpenMP is used as a method of the thread generation in this im- 
plementation because of its high portability, the proposed scheme can be used 
with other thread creation methods as well. 
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3.1 Generation of Threads 

In the proposed coarse grain task parallel processing using OpenMP, threads are 
generated by PARALLEL SECTIONS directive only once at the beginning of 
the execution of program. 

Generally, upper level threads fork nested threads to realize nested or hier- 
archical parallel processing. 

However, the proposed scheme realizes this hierarchical parallel processing 
with single level thread generation by writing all hierarchical behavior, or by 
embedding hierarchical scheduling routines, in each section between PARAL- 
LEL SECTIONS and END PARALLEL SECTIONS. This scheme allows us to 
minimize thread fork and join overhead and to implement hierarchical coarse 
grain parallel processing without any language extension. 

3.2 Macro-task Scheduling 

This section describes code generation scheme using static and dynamic schedul- 
ing to assign macro-tasks to threads or thread groups hierarchically. 

In the coarse grain task parallel processing by OSCAR compiler, the macro- 
tasks are assigned to threads or thread groups at run-time and/or at compilation- 
time. OSCAR compiler can choose the centralized dynamic scheduling and/or 
the distributed dynamic scheduling scheme in addtion to static scheduling. These 
scheduling methods are suitably used considering parallelism of the source pro- 
gram, a number of processors, data transfer and syncronization overhead of a 
target multiprocessor system with their any combinations. In the centralized 
dynamic scheduling, scheduling code is assigned to a single thread. In the dis- 
tributed dynamic scheduling, scheduling code is distributed to before and after 
each task assuming exclusive access to the scheduling tables. More concretely, the 
compiler chooses static scheduling for a macro-task graph with data dependence 
edges and dynamic scheduling for a macro-task graph with control dependence 
edges in each layer, or nest level. Also, centralized scheduler is usually chosen for 
a processor, or a thread group, and distributed scheduler is chosen for a processor 
cluster with low mutual exclusion overhead to shared scheduling information. 

Those scheduling methods can be hierarchically combined freely depending 
on program parallelism, the number of processors available for the program layer, 
syncronization overhead and so on. 

Centralized Dynamic Scheduling. In centralized scheduling scheme, one 
thread in a parallel processing layer choosing centralized scheduling serves as 
centralized scheduler, which assigns macro-tasks to thread groups, namely, pro- 
cessor clusters or processor elements. This thread is called scheduler or Central- 
ized Scheduler. 

The behavior of CS written in OpenMP “SECTION” is shown in the follow- 
ing. 

stepl Receive a completion or branch signal from each macro-task. 

step2 Check Earliest Executable Condition, and enqueue ready macro-tasks, 

which satisfy this condition, to a ready task queue. 
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step3 Find a processor cluster, or a thread group, to which a ready macro-task 
should be assigned according to the priority of Dynamic CP method. m 
step4 Assign macro-task to the processor cluster or the processor element. If the 
assigned macro-task is “End MT” (EMT), the centralized scheduler finishes 
scheduling routine in the layer. 
step5 Jump to stepl. 

In the centralized scheduling scheme, ready macro-tasks are initially assigned 
to thread groups. Each thread group executes these macro-tasks at the begin- 
ning. When macro-task finishes execution or determines branch direction, it 
sends signal to the centralized scheduler. Therefore the centralized scheduler 
busy waits for these signals at the start. 

If the centralized scheduler receives these signals, it searches new executable, 
or ready, macro-tasks by checking Earliest Executable Condition for each macro- 
task. 

If a ready macro-task is found, the centralized scheduler finds a thread group, 
or a thread, to which a macro-task should be assigned. The centralized scheduler 
assigns macro-task and goes back to the signal waiting routine. 

On the other hand, slave threads execute macro-tasks assigned by the cen- 
tralized scheduler. Their behavior is summarized in following. 

stepl Wait macro-task assignment information from the centralized scheduler 
step2 Execute the assigned macro-task 
step3 Go back to stepl 

So, at the beginning of OpenMP “SECTION” or a thread, slave thread exe- 
cutes the busy/ wait code for waiting assignment information from the centralized 
scheduler if a macro-task isn’t assigned initially. 

In the dynamic scheduling mode, since every thread or thread group has 
possibility to execute all macro-tasks, the whole code including all macro-tasks 
is copied into each OpenMP “SECTION” for each slave thread. 

Figure 0 shows an image of generated OpenMP code for every thread. Sub 
macro-tasks generated inside of Macro-Task(MT)3, a macro-task in the 1st layer, 
are represented as macro-tasks in the 2nd layer, MT3_1, MT3_2, and so on. The 
2nd layer macro-tasks are executed on “k” threads for program execution and 
one thread to serve as a centralized dynamic scheduler. Moreover, MTs like 
MT3_2_1 and MT3_2_2 in the 3rd layer are generated inside MT3 in the 2nd 
layer. Figure 0 exemplifies a code image in a case where the 3rd layer MTs like 
MT3_2_1 and MT3_2_2 are dynamically scheduled to threads by the distributed 
scheduler. The details of distributed scheduling are described later. 

After the completion of the execution of a macro-task, the slave threads go 
back to the routine to wait for the assignment of a macro-task by a centralized 
scheduler. 

Also, the compiler generates a special macro-task called “End MT”(EMT) in 
each layer. As shown in the 2nd layer in FigEl the EndMT is written at the end 
of all OpenMP “SECTION” , and CS assigns EMT to the all thread groups when 
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Thread 1 Thread 2 Thread k+1 

!$OMP SECTION !$OMP SECTION !$OMP SECTION 




] 1st layer; Static scheduling 
] 2nd layer; Centralized dynamic scheduling 



3rd layer; Distributed dynamic scheduling 



Distributed dynamic Scheduler In 3rd layer 



Fig. 2. Code image (^threads) 



all of the threads executing the same hierarchy are finished. After assigning EMT 
to the all threads, CS finishes scheduling routine in the layer. Each thread group 
jumps to outside of its hierarchy. If the hierarchy is a top layer, the program 
finishes the execution. If there exists an upper layer, threads continue to execute 
the upper layer macro-tasks. 

In the example shown in FigO, threadl finishes the execution in 2nd layer 
by “goto 70” after “End MT” and jumps out to a statement “70 continue” in the 
upper layer. 

Distributed Dynamic Scheduling. Each thread group or processor cluster 
schedules a macro-task to itself and executes the macro-task in the layer where 
the distributed dynamic scheduler is chosen. 

In the distributed scheduling scheme, all shared data for scheduling are as- 
signed onto shared memory and accessed exclusively. 

stepl Search executable, or ready, macro-tasks that satisfy Earliest Executable 
Condition by the completion or a branch of the macro-task and enqueue 
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the ready macro-tasks to the ready queue with exclusive access to data on 
shared memory required for dynamic scheduling. 
step2 Choose a macro-task, which the thread should execute next, considering 
Dynamic CP algorithm’s priority. 
step3 Execute the macro-task 

step4 Update the Earliest Executable Condition table exclusively, 
steps Go back to stepl 

For example, the 3rd layer shown in FigO uses the distributed dynamic 
scheduling scheme. In this example, the distributed scheduling is applied to 
inside of MT3_2 in second layer, and two thread groups that consist of one 
thread are realized in this layer by only executing the thread code generated by 
compiler. 

Static Scheduling Scheme. If a macro-task graph in a target hierarchy has 
only data dependencies, the static scheduling is applied to reduce data transfer, 
synchronization and scheduling overheads. 

In the static scheduling, the assignment of macro-tasks to processor clusters 
or processor elements is determined at compile-time. Therefore, each OpenMP 
“SECTION” needs only the macro-tasks that should be executed in the or- 
der predetermined by static scheduling algorithms CP/DT/MISF, DT/CP and 
ETF / CP. In other words, the compiler generates different program to each 
threads as shown in the first layer of FigO When this static scheduling is used, 
it is assumed that each thread is binded to a processor. 

In the OSCAR compiler, all of those algorithms are applied to the same 
macro-task graph, and the best schedule is automatically chosen. 

At runtime, each thread group needs to synchronize and transfer shared data 
among other thread groups in the same hierarchy to satisfy the data dependency 
among macro-tasks. 

To realize the data transfer and synchronization, the following code genera- 
tion scheme is used. 

If a macro-task assigned to the thread group is Basic Block, only one thread 
in the thread group executes the Basic Block(BB) or Basic Pseudo Assign- 
ment(BPA). Therefore, the code for MTl(BPA) is written in an OpenMP “SEC- 
TION” for one thread as shown in the first layer in FigO OpenMP “SECTIONS” 
for the other threads don’t have the code for MTI(BPA). 

A parallel loop, for example RB2, is decomposed into smaller loops like 
MT2_1 to MT2_(K-|-1) at compile-time and defined as independent MTs as 
shown in FigO Each thread has a code for assigned RBs, or decomposed par- 
tial parallel loops. In this case, since the partial loops are different macro-tasks, 
barrier synchronization at the end of the original parallel loops is not required. 

If a macro-task is a sequential loop to which data localization cannot be 
applied, only one thread or thread group executes it. The sequential loop, or a 
RB, assigned to a processor cluster has large execution cost, its body hierarchi- 
cally decomposed into coarse grain tasks. In Figfl MT3 is a sequential loop and 
decomposed into MT3_1, MT3_2, and so on. 
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Fig. 3. Overview of OSCAR Fortran Compiler 



4 Performance Evaluation 

This section describes the optimization for exploiting coarse grain task paral- 
lelization by OSCAR Fortran Compiler and its performance for several programs 
in Perfect benchmarks and SPEC 95fp benchmarks on IBM RS6000 SP 604e High 
Node 8 processor SMP. 

4.1 OSCAR Fortran Compiler 

Figure 0 shows the overview of OSCAR Fortran Compiler. It consists of Front 
End, Middle Path and Back Ends. OSCAR Fortran Compiler has various Back 
Ends for different target multiprocessor systems like OSCAR distributed/shared 
memory multiprocessor system}^, Fujitsu’s VPP supercomputer, UltraSparc, 
PowerPC, MPI-2 and OpenMP. OpenMP Back End used in this paper, which 
generates the parallelized Fortran source code with OpenMP directives. In other 
words, OSCAR Fortran Compiler is used as a preprocessor that transforms from 
an ordinary sequential Fortran program to OpenMP Fortran program for SMP 
machines. 

4.2 Evaluated Programs 

The programs used for performance evaluation are ARC2D in Perfect Bench- 
marks, SWIM, TOMCATV, HYDR02D, MGRID in SPEC 95fp Benchmarks. 
ARC2D is an implicit finite difference code for analyzing fluid flow problems and 
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Fig. 4. Macro-flow graph of subroutine STEPFX in ARC2D 



solves Euler equations. SWIM solves the system of shallow water equations us- 
ing finite difference approximations. TOMCATV is a vectorized mesh generation 
program. HYDR02D is a vectorizable Fortran program with double precision 
floating-point arithmetic. MGRID is the Multi-grid solver in 3D potential held. 

As an example of the exploitation of coarse grain parallelism, parallelization 
of ARC2D is briefly explained. ARC2D has about 4500 statements including 40 
subroutines. More than 90% of the execution time is spent in subroutine IN- 
TEGR. In the subroutine INTEGR, subroutines FILERX, FILERY, STEPFX, 
STEPFY consume 62% of the total execution time. Here, as an example, sub- 
routine STEPFX is treated since it consumes 30% of the total execution time. 

Macro Flow Graph of subroutine STEPFX is shown in Fig. 4. The first layer 
in FigEJconsists of three macro-tasks. Macro-Task(MT) 2 in the first layer is a 
sequential loop having three iterations with large processing time. 

To minimize the processing time of MT2, the second layer macro-tasks are 
defined for a loop body of MT2, such as twelve Basic Blocks (MT1~6, 8, 9, 11, 
14, 15, 18), four DOALL loops (MT7, 10, 12, 13) and four Subroutine Blocks 
(MT16, 17, 19, 20). 

In the second layer of Fig0, MT groups, namely, MTs 1 and 2, MTs 3 and 
4, MTs 5 and 6, MTs 15, 16 and 17 and MTs 18, 19 and 20, are executed 
depending on a value of loop index N. For example, MT2 is executed by the 
result of conditional branch inside MTl (a Basic Block) when the value of loop 
index N is 2. MT4 is executed when N=3. MT6 is executed when N=4. By the 
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Fig. 5. Optimized macro-task graph of subroutine STEPFX in ARC2D 



result of conditional branch inside MT14 depending on loop index, MT15,16 
and 17 are executed when N=2, and MT18,19 and 20 are executed when N=3 
and 4. Also, MT15 and 18 in the second layer of Figg are Basic Blocks having 
conditional branches depending on the input variable from files. By the result 
of conditional branches inside MT 15 and 18, either MTs 16 and 19 or MTs 17 
and 20 are executed. 

At first, loop unrolling is applied to MT2 having 3 iterations, in the first layer 
of Figg As the result of loop unrolling, OSCAR compiler can remove conditional 
branches inside MT 1, 3, 5 and 14 in the second layer of Figg depending on 
loop index N. By loop unrolling, SB16, 17, 19 and 20 in the second layer are 
transformed to SB8, 9, 17, 18, 26 and 27 in FigO . Either macro-task group 
composed of SBs 8, 17 and 26 or SBs 9, 18 and 27 in Fig0 is executed by 
conditional branches inside MT7, 16 and 25. As the result of interprocedure 
analysis to these subroutine blocks, subroutine blocks inside each group can be 
processed in parallel. Figure g shows the optimized Macro-Task Graph. 

OSCAR compiler applies the similar optimizations to other subroutines called 
from subroutine INTEGR. In addition, the inline expansion are applied to the 
subroutine calls in subroutine INTEGR for extracting more coarse grain paral- 
lelism. 

Figure|S|shows the optimized Macro-Task Graph for the subroutine INTEGR. 
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Fig. 6. Optimized macro-task graph of subroutine INTEGR in ARC2D. 



4.3 Architecture of IBM RS6000 SP 

RS6000 SP 604e High Node used for the evaluation is a SMP server having eight 
PowerPC 604e (200 MHz). Each processor has 32 KB LI instruction and data 
caches and 1 MB L2 unified cache. The shared main memory is 1 GB. 
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4.4 Performance on RS6000 SP 604e High Node 

In this evaluation, a coarse grain parallelized program automatically generated 
by OSCAR compiler is compiled by IBM XL Fortran compiler version 5.1I2BI 
and executed on 1 through 8 processors of RS6000 SP 604e High Node. The 
performance of OSCAR compiler with XL Fortran compiler is compared with 
native IBM XL automatic parallelizing Fortran compiler [2S|- In the compilation 
by a XL Fortran, maximum optimization option “-qsmp=auto -03 -qmaxmem=- 
1 -qhot” is used. 

Figure [3Ja) shows speed-up ratios for ARC2D by the proposed coarse grain 
task parallelization scheme in OSCAR compiler and the automatic loop paral- 
lelization by XL Fortran compiler. The sequential processing time for ARC2D 
was 77.5s and parallel processing time by XL Fortran version 5.1 compiler using 
8 processors was 60.1s. On the other hand, the execution time of coarse grain 
parallel processing using 8 processors by OSCAR Fortran compiler combined 
with XL Fortran compiler was 23.3s. In other words, OSCAR compiler gave 
us 3.3 times speed up against sequential processing time and 2.6 times speed 
up against native XL Fortran compiler for 8 processors. The number of loop 
iterations consuming large execution time in ARC2D has a small number of it- 
erations. Therefore, only use of the loop level parallelism cannot attain large 
performance improvement even if more than 4 or 5 processors are used. The 
performance difference between OSCAR compiler and XL Fortran compiler in 
FigCJa) come from the coarse grain parallelism detected by OSCAR compiler. 

Next, Figl3^b) shows speed-up ratio for SWIM. The sequential execution time 
of SWIM was 55Is. While the automatic loop parallel processing time using 8 
processors by XL Fortran needed 112.7s and 4.9 times speed-up was attained, 
coarse grain task parallel processing by OSCAR Fortran compiler required only 
61.1s and gave us 9.0 times speed-up by the effective use of distributed caches 
against the sequential execution time and 1.8 times speed-up compared with XL 
Fortran compiler. 

Figure CKc) shows speed-up ratio for TOMCATV. The sequential execution 
time of TOMCATV was 691s. The parallel processing time using 8 processors 
by XL Fortran was 484s and 1.4 times speed-up against sequential execution 
time. On the other hand, the coarse grain parallel processing using 8 processors 
by OSCAR Fortran compiler was 154s and gave us 4.5 times speed-up against 
sequential execution time. OSCAR Fortran compiler also gave us 3.1 times speed 
up compared with XL Fortran compiler using 8 processors. 

Figure shows speed-up in HYDR02D. The sequential execution time 
of Hydro2d was 1036s. While XL Fortran gave us 4.7 times speed-up (221s) 
using 8 processors compared with the sequential execution time, OSCAR Fortran 
compiler gave us 8.1 times speed-up (128s) compared with sequential execution 
time. 

Finally, Fig|3(e) shows speed-up ratio for MGRID. The sequential execution 
time of MGRID was 658s. For this application, XL Fortran compiler attains 
4.2 times speed-up, or processing time of 157s, using 8 processors. Also, OS- 
CAR compiler achieved 6.8 times speed up, or 97.4s. Namely, OSCAR Fortran 
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compiler gave us 1.6 times speed-up compared with XL Fortran compiler for 8 
processors. 
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Fig. 7. Speed-up of several benchmarks on RS6000 



5 Conclusions 

This paper has presented the implementation scheme of the automatic coarse 
grain task parallel processing using OpenMP API as an example of realization 
and its performance on an off the shelf SMP machine. 






Automatic Coarse Grain Task Parallel Processing on SMP Using OpenMP 205 



OSCAR compiler generates coarse grain parallelized code which forks threads 
only once at the beginning of a program and joins only once at the end to min- 
imize the overhead though hierarchical coarse grain task parallelism are auto- 
matically exploited. 

In the performance evaluation, OSCAR compiler with XL Fortran compiler 
gave us scalable speed up for application programs in Perfect and SPEC 95fp 
benchmarks and significant speed-up compared with native XL Fortran compiler, 
such as 2.6 times for ARC2D, 1.8 times for SWIM, 3.1 times for TOMCATV, 
1.7 times for HYDR02D and 1.6 times for MGRID when the 8 processors are 
used. In other words, OSCAR Fortran compiler can boost the performance of 
XL Fortran compiler, which is one of the best commercially available loop paral- 
lelizing compilers for IBM RS6000 SP 604e High Node, easily using coarse grain 
parallelism with low overhead. 

Currently, the authors are planning to evaluate the proposed coarse grain task 
parallel processing scheme on other SMP machines using OpenMP and improv- 
ing the implementation scheme to further reduce dynamic scheduling overhead 
and data transfer overhead using data localization scheme to use distributed 
cache efficiently. 
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1 Introduction 

Task graphs and their equivalents have proved to be a valuable abstraction for 
representing the execution of parallel programs in a number of different appli- 
cations. Perhaps the most widespread use of task graphs has been for perfor- 
mance modeling of parallel programs, including quantitative analytical mod- 
els 0 , [El 1^ EF)1 177j , theoretical and abstract analytical models and pro- 
gram simulation 0IE|. A second important use of task graphs is in parallel pro- 
gramming systems. Parallel programming environments such as PYRROS |2S!, 
CODE j^, HENCE [25 1 and Jade [2D1 have used task graphs at three differ- 
ent levels: as a programming notation for expressing parallelism, as an internal 
representation in the compiler for computation partitioning and communication 
generation, and as a runtime representation for scheduling and execution of par- 
allel programs. Although the task graphs used in these systems differ in represen- 
tation and semantics (e.g., whether task graph edges capture purely precedence 
constraints or also dataflow requirements), there are close similarities. Perhaps 
most importantly, they all capture the parallel structure of a program separately 
from the sequential computations, by breaking down the program into computa- 
tional “tasks” , precedence relations between tasks, and (in some cases) explicit 
communication or synchronization operations between tasks. 

If task graph representations could be constructed automatically, via com- 
piler support, for common parallel programming standards such as Message- 
Passing Interface (MPI), High Performance Fortran (HPF), and OpenMP, the 
techniques and systems described above would become available to a much wider 
range of programs than they are currently. Within the context of the POEMS 
project PI, we have developed a task graph based application representation 
that is used to support modeling of the end-to-end performance characteristics 
of a large-scale parallel application on a large parallel system, using a combina- 
tion of analytical, simulation and hybrid models, and models at multiple levels 
of resolution for individual components. This paper describes how parallelizing 
compiler technology can be used to automate the process of constructing this 
task graph representation for HPF programs compiled to MPI (and, in the near 

S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 208-|37g 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 
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future, for existing MPI programs directly). In particular, this paper makes the 
following contributions: 

— We describe compiler techniques to derive a static, symbolic task graph rep- 
resenting the MPI code generated for a given HPF program. A key aspect 
of this process is the use of symbolic integer sets and mappings to capture a 
number of dynamic task instances or edge instances as a single node or edge 
at compile time. These techniques allow the compiler to describe sophis- 
ticated computation partitionings and communication and synchronization 
patterns in symbolic terms. 

— We describe how standard analysis techniques can be used to condense the 
task graph and simplify control flow, whenever less than fully detailed infor- 
mation suffices (as in many performance modeling applications in practice). 

— Finally, we describe an approach to instantiate a dynamic task graph repre- 
sentation from the static task graph, based on a novel use of code generation 
from symbolic integer sets. 

In addition to the above techniques, which to our knowledge are new, the 
compiler also uses standard techniques to compute symbolic scaling functions 
for task computation times and message communication volumes. 

The techniques described above have been implemented in the Rice dHPF 
compiler system, which compiles HPF programs to MPI for message-passing 
systems using aggressive techniques for computation partitioning and communi- 
cation optimization P3E1I22! This implementation was recently used in a joint 
project with the parallel simulation group at UCLA to improve the scalability of 
simulation of message passing programs 0. In that work, we showed how com- 
piler information captured in the task graph can be used to reduce the memory 
and time requirements for simulating message-passing programs in detail. In the 
context of the present paper, these results illustrate the potential importance of 
automatically constructing task graphs for widely used programming standards. 

The next section briefly describes the key features of our static and dy- 
namic task graph representations. SectionOlis the major technical section, which 
presents the compiler techniques described above to construct the task graph rep- 
resentations. Section 0 provides some results about the structure of the compiler- 
generated task graphs for simple programs and illustrates how task graphs have 
been used to improve the scalability of simulation, as mentioned above. We con- 
clude with a brief overview of related work (Section and a discussion of future 
plans (Section EJ- 

2 Background: The Task Graph Representation 

The POEMS project P] aims to create a performance modeling environment 
for the end-to-end modeling of large parallel applications on complex parallel 
and distributed systems. The wide range of modeling techniques supported by 
POEMS, and the goal of integrating multiple modeling paradigms make it chal- 
lenging, if not impossible, for the end-user to generate the required workload 
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information manually for a large-scale application. Thus, since the conception 
of the project, it has been deemed essential to use compiler support to simplify 
and partially automate the process of constructing the workload information. To 
achieve this, we have designed a common task graph based program represen- 
tation that provides a uniform platform for capturing the parallel structure of 
a program as well as its associated workloads for different modeling techniques. 
This representation uses two flavors of a task graph, the static task graph and 
the dynamic task graph. Its design is described in detail in and is briefly 
summarized here. The specific information we aim to collect for a given program 
includes: (1) The detailed computation partitioning and communication struc- 
ture of the program, described in symbolic terms. (2) Source code for individual 
tasks to support source-code-driven uses such as detailed program-driven sim- 
ulation of memory hierarchy performance. (3) Scaling functions that describe 
how computation and communication scale as a function of program inputs and 
processor configuration. (4) Optionally, the detailed dynamic behavior of the 
parallel program, for a specified program input and processor configuration. 

The Static Task Graph: The static task graph (STG) captures the static parallel 
structure of a program and is defined only by the program per se. Thus, it is in- 
dependent of runtime input values, intermediate program results, and processor 
configuration. Each node (or task) of the graph may represent one of the follow- 
ing main types: control flow statements such as loops and branches, procedure 
calls, communication, or pure computation. Edges between nodes may denote 
control flow within a processor or synchronization between different processors 
(due to communication tasks). For example, the STG for a simple parallel pro- 
gram is shown in Figure ^ Etnd is explained in more detail in the next section. 

A key aspect of the STG is that each node represents a set of instances 
of the task, one per processor that executes the task at runtime. Similarly, an 
edge in the STG actually represents a set of edge instances connecting pairs 
of dynamic node instances. We use symbolic integer sets to describe the set 
of instances for a given node, e.g., a task executed by P processors would be 
described by the set: {[p] : 0 < p < P — 1}, and symbolic integer mappings 
to describe the edge instances, e.g., an edge from a SEND task on processor 
p to a RECV task on processor q = p + 1 (i.e., each processor sends data to 
its right neighbor, if any) would be described by the mapping: {[p] — > [g] : 
g = p-|-lA0<p<P— 1}. This kind of mapping enables precise symbolic 
representations of arbitrary regular communication patterns. Irregular patterns 
(i.e., data-dependent patterns that cannot be determined until runtime) have 
to be represented as an all-to-all communication, which is the best that can be 
done statically. 

To capture high level communication patterns where possible (e.g., shift, 
pipeline, broadcast, etc. !ZI|) we group the communication operations in the 
program into related groups, each describing a single “logical communication 
event”. A communication event descriptor, kept separate from the STG, cap- 
tures all information about a single communication event. This includes the 
communication pattern, the set of communication tasks involved, and a symbolic 
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expression for the communication size. The CPU components of each commu- 
nication event are represented explicitly as communication tasks in the STG, 
allowing us to use task graph edges between these tasks to explicitly capture the 
synchronization implied by the underlying communication calls. The number of 
communication nodes and edges depends on the communication pattern and also 
on the type of message passing calls used. This technique does not work for MPI 
receive operations that use a wildcard message tag (because the matching send 
cannot be easily identified) . It does work for receive operations that use a wild- 
card for the sending processor, but the symbolic mapping on the communication 
edges may be an all-to-all mapping (for the processors that execute the send and 
receive statements). Making the communication tasks explicit in the STG has 
proved valuable also because it allows us to describe arbitrary interleavings (i.e., 
overlaps) of communication and computation tasks on individual processors and 
across processors. 

In addition to the symbolic sets and mappings above, each node and com- 
munication event in the STG includes a symbolic scaling function that describes 
how the task computation time or the message size scales as a function of pro- 
gram variables. Finally, note that the STG of a program containing multiple 
procedures is represented as a number of unconnected graphs, each correspond- 
ing to a single procedure. Each call site is represented by a GALL task that 
identifies the called procedure by name. 



The Dynamic Task Graph: The dynamic task graph (DTG) is a directed acyclic 
graph that captures the execution behavior of a program on a given input and 
given processor configuration. This representation is important for detailed per- 
formance modeling because it corresponds closely with the actual execution be- 
havior being modeled by a particular program performance model (whether using 
detailed simulation or abstract analytical models). 

The nodes of a dynamic task graph are computational tasks and individ- 
ual communication tasks. In particular, the DTG does not contain control flow 
nodes (loops, branches, jumps, and jump targets). It can be thought of as being 
instantiated from the static task graph by unrolling all the loops, resolving all 
the branches, and instantiating all the instances of parallel tasks, edges, and 
communication events. 

There are two approaches to making this representation tractable for large- 
scale programs, and these approaches can be combined: (1) we can condense tasks 
allocated to a process between synchronization points so that only (relatively) 
coarse-grain parallel tasks are explicitly represented, and (2) if necessary, we 
can compute the dynamic task graph “on the fly,” rather than precomputing 
it and storing it offline. We describe techniques to automatically condense the 
task graph in Section I, ' 1 . 21 The approach to instantiate the task graph on-the-fly 
is outside the scope of this paper, but is a direct extension of the compile-time 
instantiation of the DTG described in Section |S1 
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3 Compiler Techniques for Synthesizing the Task Graphs 

As noted in the Introduction, there are three major aspects to synthesizing our 
task graph representation for a parallel program: (1) synthesizing the static, 
symbolic task graph; (2) condensing the task graph; and (3) optionally instanti- 
ating a dynamic task graph representing an execution on a particular program 
input. Each of these steps relies on information about the message-passing pro- 
gram gathered by the compiler, although for many programs the third step can 
be performed purely by inspecting the static task graph, as explained in Sec- 
tion |^1 These steps are described in detail in the following three subsections. 

3.1 Synthesizing the Static Task Graph 

Four key steps need be performed in synthesizing the static task graph (STG): 
(1) generating computation and control- flow nodes; (2) generating communica- 
tion tasks for each logical communication event; (3) generating symbolic sets 
describing the processors that execute each task, and symbolic mappings de- 
scribing the pattern of communication edges; and (4) eliminating excess control 
flow edges. 

Generating computation and control-flow nodes in the STG can be done in a 
single preorder traversal of the internal representation for each procedure; in our 
case, the representation is an Abstract Syntax Tree (AST). STG nodes are cre- 
ated as appropriate statements are encountered in the AST. Thus, program state- 
ments, such as DO, IF, CALL, PROGRAM/FUNCTION/SUBRQUTINE, STOP/RETURN, 
trigger the creation of a single node in the graph; encountering one of the first two 
also leads to the creation of an enddo-node or an endif-node, a then-node 
and an else-NODE, respectively. Any contiguous sequence of other computation 
statements that are executed by the same set of processors are grouped into a 
single computational task (contiguous implies that they are not interrupted by 
any of the above statements or by communication) . 

Identifying statements that are computed by the same set of processors is 
a critical aspect of the above step. This information is derived from the com- 
putation partitioning phase of the compiler and is translated into a symbolic 
integer set Q that is included with each task. By having a general representa- 
tion of the set of processors associated with each task, our representation can 
describe sophisticated computation partitioning strategies. The explicit set rep- 
resentation also enables us to check equality by direct set operations, in order 
to group statements into tasks. These processors sets are also essential for the 
fourth step listed above, namely eliminating excess control-flow edges between 
tasks, so as to expose program parallelism. In particular, a control flow edge is 
retained between two tasks only if the intersection of their processor sets is not 
empty. Otherwise, the sink node is connected to its most immediate ancestor in 
the STG for which the result of this intersection is a non-empty set. 

When the first communication statement for a logical communication event 
is encountered, the communication event descriptor and all the communication 
tasks that are pertinent to this single event are built. The processor mappings 
for the synchronization between the tasks are also built at this time. 
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CHPF$ DISTRIBUTE A(*,BL0CK) 

DO 1=2, N 
DO 

A(I,J) = A(I-1,J+1) 

ENDDO 

ENDDO 

(a) HPF source code fragment 

blk = block size per processor 
DO 1=2, N 

IF (myid < P-1) 

irecv B(i, myid*blk+blk+l) from myid+1 

! Execute local iterations of j-loop 
DO J=myid*blk+1 , min(myid*blk+blk-l , M-1) 
A(I,J) = A(I-1,J+1) 

ENDDO 

IF (myid > 0) isend B(i, myid*blk+l) to myid-1 
IF (myid < P-1) wait-recv 
IF (myid > 0) wait-send 

! Execute non-local iterations of j-loop 

J=myid*blk+blk 

IF (J <= M-1) 

A(I,J) = A(I-1,J+1) 

ENDDO 

(b) Unoptimized MPl code generated by dHPF 




(c) Statie task graph 



Fig. 1. An example of generating the communication tasks. 



For an explicit message-passing program, the computation partitioning in- 
formation can be derived by analyzing the control-flow expressions that depend 
on process id variables. The communication pattern information has to be ex- 
tracted by recognizing the communication calls syntactically, analyzing their ar- 
guments, and identifying the matching send and receive calls. In principle, both 
the control-flow and the communication can be written in a manner that is too 
complex for the compiler to decipher, and some message passing programs will 
probably not be analyzable. But in most of the programs we have looked at, the 
control-flow idioms for partitioning the computation and the types of message 
passing operations that are used are fairly simple. We believe that the required 
analysis to construct the STG would be feasible with standard interprocedural 
symbolic analysis techniques HS|. 

To illustrate the communication information built by the compiler, consider 
the simple HPF example, which, along with the MPI parallel code generated by 
the dHPF compiler, are shown on the left-hand side of Figured The paralleliza- 
tion of the code requires the boundary values of array A along the j dimension to 
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be communicated inside the I loop. (In practice, the compiler pipelines the com- 
munication in larger blocks by strip-mining the I loop nn but that is omitted to 
simplify the example.) The corresponding STG is shown on the right-hand side of 
the figure. Solid lines represent control flow edges and dashed lines represent in- 
terprocessor synchronization. In this example, the compiler uses the non-blocking 
MPI communication primitives. The two dashed lines show that the wait-recv 
operation cannot complete until the isend is executed, and the isend cannot 
complete until the irecv is issued by the receiver (the latter is true because our 
target MPI library uses sender-side buffering) Q Also, the compiler interleaves 
the communication tasks and computation so as to overlap waiting time at the 
isend with the computation of local loop iterations, i.e., the iterations that do 
not read or write any off-processor data. The use of explicit communication tasks 
within the task graph allows this overlap to be captured precisely in the task 
graph. The dashed edge between the isend and the wait-recv tasks is associ- 
ated with the processor mapping: {[po] [qo] : 9o = Po ~ 1 A 0 < go < P ~ 1}? 
denoting that each processor receives data from its “right” neighbor, except the 
rightmost boundary processor. The other dashed edge has the inverse mapping, 
i.e., go = Po + 1- 

Finally, the compiler constructs the symbolic scaling functions for each task 
and communication event, using direct symbolic analysis of loop bounds and 
message sizes. For a communication event, the scaling function is simply an 
expression for the message size. For each do-NODE the scaling function describes 
the number of iterations executed by each processor, as a function of processor id 
variables and other symbolic program variables. In the simple example above, the 
scaling functions for the two DO nodes are N-1 and min(myid*blk+blk-l ,M-1) - 
(myid*blk+l) + 1, respectively. For a computational task, the scaling function 
is a single parameter representing the workload corresponding to the task. At 
this stage of the task graph construction, no further handling (such as symbolic 
iteration counting for more complex loop bounds), takes place. 

3.2 Condensing Nodes of the Static Task Graph 

The process described above produces a first-cut version of the STG. For many 
typical modeling studies of parallel program performance, however, a less de- 
tailed graph will be sufficient. For instance, a coarse-grain modeling approach 
could assume that all operations of a single process between two communication 
points constitute a single task. In order to add this functionality, the compiler 
traverses the STG and marks contiguous nodes, connected by a control-flow edge, 
that do not include communication. Such sequences of nodes are then collapsed 
and replaced in the STG by a single condensed task. Note that such a task will 
have a single point of entry and a single point of exit. For example, the two large 

^ More precisely, the isend task should be broken into two tasks, one that performs 
local initialization and does not depend on the irecv, and a second one that can 
only be initiated once the irecv has been issued but does not block the local com- 
putation on the sending node. This would be simply require introducing additional 
communication tasks into the task graph. 
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DO 1=1, N 
RECV 
DO J=1,M 
SI 

IF (f) 
S2 
END IF 
S3 
ENDDO 
SEND 
ENDDO 




(a) Source; (b) Initial STG; (c) Collapsed STG i/f=G(I); (d) Collapsed STG i/f=Y(J) 



Fig. 2. Collapsing Branches of the Static Task Graph. 



dotted rectangles in Figure Q (c) correspond to sequences of tasks that can be 
collapsed into a single condensed task. 

To preserve precision, the computation of the scaling function of the new 
condensed task is of particular importance. Ignoring conditional control flow for 
the moment, this scaling function is the symbolic sum of the scaling functions 
of the individual collapsed tasks, each multiplied by the symbolic number of 
iterations of the surrounding loops, where appropriate (only if these loops are 
also collapsed). 

In cases with conditional control flow, tasks can sometimes be condensed 
with no loss of accuracy, using sophisticated compiler analysis. For example, no 
accuracy will be lost in cases where all dynamic instances of the resulting col- 
lapsed task have identical computational time; that is, the workload expressions 
are free of any input parameters whose value changes for different instances of 
that task. In other cases, condensing would result in some loss of accuracy, and 
the goals of the modeling study should be used to dictate the degree to which 
tasks are collapsed together. 

To illustrate, consider the code shown in Figure |2| (a). Let w\,W 2 ,w^ rep- 
resent the workload (i.e., the scaling function) for statements SI, S2 and S3, 
respectively. The initial version of the STG is shown in Figure |21 (b); the nodes 
inside the dotted rectangle are candidates for collapsing. Assuming that the 
function f in the IF depends on at least one of I or J, we distinguish between: 



— If f is a function of I only, the IF statement can be moved outside the J 
loop and the J loop can be collapsed with no loss of accuracy. In this case, 
we are left with two separate tasks, representing the two possible versions of 
the J loop, as shown in part (c) of the figure. These two tasks have scaling 
functions given by M x (wi + W2 + W3), M x (wi + W3), respectively. 
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— If f is a function of J only, the code can be condensed into a single task as 
shown in Figure 0d). The scaling function of the task T would he M x w, 
where w is the workload inside the J loop body per iteration of the I loop. 

— Finally, if f is a function of both I and J, we can condense the code only 
by introducing a branching probability parameter. If p(l) represents the 
probability that S2 will be executed for a given value of I, then the entire 
code inside the dotted rectangle can be condensed into a single task (as in 
part (d)) with a combined scaling function given by Mx (rci +p(l) x W 2 + W 3 ). 
Since this probabilistic expression for execution time can lead to inaccuracies, 
the decision to condense the task graph in such cases should depend on the 
goals of the modeling study. 

The three cases can be differentiated using well-known but aggressive dataflow 
analysis. We note that the first two cases correspond directly to loop unswitching 
and identifying loop-invariant code respectively, except that only the static task 
graph is modified and the code itself is not transformed. A key point to note is 
that in the first two cases, there is no resulting loss of accuracy in condensing the 
task graph. For example, in the ASCI benchmark SweepSD ^ used in Sectional 
the one significant branch is in fact of the first type, which can be pulled out 
of the task and enclosing loops (the analysis would have to be interprocedural 
because the enclosing loops are not in the same procedure as the branch). 

3.3 Instantiating the Dynamic Task Graph 

As noted in SectionO the dynamic task graph (DTG) is essentially an instantia- 
tion of the STG representing a single execution for a particular input. The DTG 
is an acyclic graph containing no control-flow nodes. The time for instantiating 
the DTG grows linearly with the number of task instances in the execution of 
the program, but much less computation per task is usually required for the 
instantiation than for the actual execution. This is an optional step that can be 
performed when required for detailed performance prediction. 

The information required to instantiate the DTG varies significantly across 
programs. For a regular, non-adaptive code, the parallel execution behavior of the 
program can usually be determined directly from the program input (in which 
we include the processor configuration parameters). In such cases, the DTG can 
be instantiated directly from the STG once the program input is specified. In 
general, and particularly in adaptive codes, the parallel execution behavior (and 
therefore the DTG) may depend on intermediate computational results of the 
program. For example, this could happen in a parallel n-body problem if the 
communication pattern changed as the positions of the bodies evolved during 
the execution of the program. In the current work, we focus on the techniques 
needed to instantiate the DTG in the former case, i.e., that of regular non- 
adaptive codes. These techniques are also valuable in the latter case, but they 
must be applied at runtime when the intermediate values needed are known. 
The issues to be faced in that case are briefly discussed later in this section. 

There are two main aspects to instantiating the DTG: (1) enumerating the 
outcomes of all the control flow nodes, effectively by unrolling the DO nodes and 



Compiler Synthesis of Task Graphs 217 



resolving the dynamic instances of the branch nodes; and (2) enumerating the 
dynamic instances of each node and edge in the STG. These are discussed in turn 
below. Of these, the second step is significantly more challenging in terms of the 
compile-time techniques required, particularly for sophisticated message passing 
programs with general computation partitioning strategies and communication 
patterns. 

Interpreting eontrol-flow in the static task graph Enumerating the outcomes of 
all the control flow nodes in an execution can be accomplished by a symbolic 
interpretation of the control flow of the program for each process. First, we must 
enumerate loop index values and resolve the dynamic instances of branch con- 
ditions that have not been collapsed in the STG. This requires evaluating the 
values of these symbolic expressions. We can perform this evaluation directly 
at compile-time when these quantities are determined solely by input values, 
surrounding loop index variables, and processor id variables. Under these con- 
ditions, we know all the required variable values in the expressions, as follows. 
The input variable values are specified externally. The loop index values are ex- 
plicitly enumerated for all DO nodes that are retained in the static task graph. 
The processor id variables are explicitly enumerated for each parallel task using 
the symbolic processor sets, as discussed below. Therefore, we can evaluate the 
relevant symbolic expressions for enumerating the control-flow outcomes. 

We assumed above that key symbolic quantities were determined solely by 
input values, surrounding loop index variables, and processor id variables. These 
requirements only apply to those loop bounds and branch conditions that are re- 
tained in the collapsed static task graph (i.e., which affect the parallel task graph 
structure of the code), and not to loops and branches that have been collapsed 
because they only affect the internal computational results of a task. With the 
exception of a few common algorithmic constructs, we find these requirements to 
be satisfied by a fairly large class of regular scientific applications. For example, 
in a collection of codes including the three NAS application benchmarks (SP, 
BT and LU), an ASGI benchmark SweepSD and other standard data-parallel 
codes such as Erlebacher P and the SPEG benchmark Tomcatv, the sole excep- 
tions were terminating conditions testing convergence in the outermost timestep 
loops. In such cases, we would rely on the user to specify a fixed number of time 
steps for which the program performance would be modeled. 

More generally, and particularly for adaptive codes, we expect the parallel 
structure to depend on intermediate computational results. This would require 
generating the DTG on the fly, e.g., when performing program-driven simulation 
during which the actual computational results would be computed. In this case, 
the most efficient approach to synthesizing the DTG would be to use program 
slicing to isolate the computations that do affect the parallel control flow. (This 
is very similar to the use of slicing for optimizing parallel simulation as described 
in Section 0) These extensions are outside the scope of this paper. 

Enumerating the symbolic sets and mappings The second challenge is that we 
must enumerate all instances of each parallel task and each communication edge 
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between tasks. These instances are described by symbolic sets and mappings 
respectively. In the context of complex computation partitioning strategies and 
arbitrary communication patterns, this presents a much more difficult problem 
at compile-time (i.e., without executing the actual program). 



CHPF$ distribute rsd(*, block, block,*) 

CHPF$ distribute uC*, block, block,*) 

CHPF$ distribute f lux(* , block, block, *) 
do k = 2, nz-1 
do j = jst, jend 
CHPF$ INDEPENDENT, NEW(flux) 

do indepjimnmy_loop =1,1 
do i = 1 , nx 

do m = 1 , 5 ! ON HOME rsd(m, i-1 , j ,k) rsd(m, i+1 , j ,k) 

f lux(m, i , j ,k) = F( u(m,i,j,k) ) 
enddo 
enddo 

do i = ist, lend 

do m = 1, 5 ! ON HOME rsd(m, i , j ,k) 

rsd(m,i,j,k) = G( f lux(m, i-1 , j ,k) , f lux(m, i+1 , j ,k) ) 
enddo 
enddo 
enddo 
enddo 

enddo Source code and eomputation partitioning 



ProcsT hat Send = { [po , pi] : jst < 17pi + 17, jend, 65A0<po<3A0<pi<3 

A 17po < nx A 17pi < jend A 3 < nz} 

{ [P0 5 Pi] ■ jst < 17pi +17, jend, 65AO<po<3AO<pi<3 

A 17po < nx A 17pi < jend A 3<nzA 2<nxA 17pi < jend} 
SendToRecvP rocs Map = {[po^Pi] — ^ = pi A 0, go — 1 ^ Po ^ 3 

A jst < 17pi + 17, 65, jend A 0 < pi < 3 
A 17po < nx A 3 < nz A 17go < nx A 17pi < jend} 

(J {[po.pi] ^ [ 90 . 9i] : 0,PO - 1 < 90 < Po < 3 

A jst < 17pi + 17, 65, jend A 0 < pi < 3 A 17po < nx 
A 3 < nz A 2 + 17go < nx A 17pi < jend} 

(b) Proeessor sets and task mappings 



subroutine ProcsThatSendCnx, nz, jst, jend) 
integer nx, nz, jst, jend 

if (jend >= 1 && jst <= jend && nz >= 3 && jst <= 65) then 
do pO = 0, min(intDiv(nx-l , 17) , 3) 

do pi = max(intDiv(j st-17+16 , 17) , 0), min(intDiv(jend-l , 17) , 3) 

! Emit SEND task for processor (p0,pl) 
enddo 
enddo 

(c) Parameterized code to enumerate the set ProcsThatSend 



Fig. 3. Example from NAS LU illustrating processor sets and task mappings for 
communication tasks. Problem size: 65 x 65 x 65; Processor configuration: 4x4. 
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For example, consider the loop fragment from the NAS LU benchmark shown 
in Figure 0 The compiler automatically chooses a sophisticated computation 
partitioning, denoted by the ON home descriptors for each statement in the fig- 
ure. For example, the ON home descriptor for the assignment to flux indicates 
that the instance of the statement in iteration (k,j,i,m) should be executed 
by the processors that own either of the array elements rsd(m, i-1 , j ,k) or 
rsd(m, i+1 , j ,k) . This means that each boundary iteration of the statement will 
be replicated among the two adjacent processors. This replication eliminates the 
need for highly expensive inner-loop communication for the privatizable array 
flux 1 ^. Communication is still required for each reference to array u, all of 
which are coalesced by the compiler into a single logical communication event. 
The communication pattern is equivalent to two SHIFT operations in opposite 
directions. Part (b) of the figure shows the set of processors that must execute 
the SEND communication task {ProcsT hat Send), as well as the mapping be- 
tween processors executing the SEND and those executing the corresponding 
wait-recv (SendToRecvProcsMap). (Note that both these quantities are de- 
scribed in terms of symbolic integer sets, parameterized by the variables jst, 
jend, nx and nz.) Each of these sets combines the information for both SHIFT 
operations. Correctly instantiating the communication tasks and edges for such 
communication patterns in a pattern-driven manner can be difficult and error- 
prone, and would be limited to some predetermined class of patterns that is 
unlikely to include such complex patterns. 

Instead, we develop a novel and general solution to this problem that is based 
on an unusual use of code generation from integer sets. In ordinary compilation, 
dHPF and other advanced parallelizing compilers use code generation from inte- 
ger sets to synthesize loop nests that are executed at runtime, e.g., for a parallel 
loop nest or for packing and unpacking data for communication P3EZIISIIIII 
If we could invoke the same capability but execute the generated loop nests at 
compile-time, we could use the synthesized loop nests to enumerate the required 
tasks and edges. 

Implementing this approach, however, proved to be a non-trivial task. Most 
importantly, each of the sets is parameterized by several variables, including 
input variables and loop index variables (e.g., the two sets above are parame- 
terized by jst, jend, nx and nz). This means that the set must be enumerated 
separately for each combination of these variable values that occurs during the 
execution of the original program. We solve this problem as follows. We first 
generate a subroutine for each integer set that we want to enumerate, and make 
the parameters arguments of the subroutine. Then (still at compile-time), we 
compile, link, and invoke this code in a separate process. The desired combina- 
tions of variable values for each node and edge are automatically available when 
interpreting the control-flow of the task graph as described earlier. Therefore, 
during this interpretation, we simply invoke the desired subroutine in the other 
process to enumerate the ids for a node or the id pairs for an edge. 

To illustrate this approach. Figure El(c) shows the subroutine generated to 
enumerate the elements of the set ProcsThatSend described earlier. The loop 
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nest in this subroutine is generated directly from the symbolic integer set in 
part (b) of the figure. This loop nest enumerates the instances of the SEND 
task, which in this case is one task per processor executing the SEND. This 
subroutine is parameterized by jst, jend, nx and nz. In this example, these are 
all unique values determined by the program input. In general, however, these 
could depend on loop index variables of some outer loop and the subroutine has 
to be invoked for each combination of values of its arguments. 

Overall, the use of code generated from symbolic sets enables us to support 
a broad class of computation partitionings and communication patterns in a 
uniform manner. This approach fits nicely with the core capabilities of advanced 
parallelizing compilers. 



4 Status and Results 

We have successfully implemented the techniques described above in the dHPF 
compiler. We have extended the dHPF compiler to synthesize a static task graph 
for the MPI code generated by dHPF, including the symbolic processor sets and 
mappings for communication tasks and edges, and the scaling functions for loop 
nodes. In computing the condensed static task graph, we collapse all do-NODEs 
or sequences of computational tasks that do not contain any communication or 
any if-NODEs (We would rely on user intervention to collapse if-NODEs). We 
also compute the combined scaling function for the collapsed tasks. 

We have also partially implemented the support to instantiate dynamic task 
graphs at compile-time. In particular, we are able to enumerate the task instances 
and control- flow edges. We also synthesize the code from symbolic integer-sets 
required to enumerate the edge mappings at compile-time. We do not yet link 
in this code to enumerate the edges at compile-time. 

Because of the aggressive computation partitioning and communication strate- 
gies used by dHPF, capturing the resulting MPI code requires the full generality 
of our task graph representation. This gives us confidence that we can synthe- 
size task graphs for a wide range of explicit message-passing programs as well 
(including all the ones we have examined so far). 

In order to illustrate the size of the static task graph generated and the 
effectiveness of condensing the task graph. Table E lists some particulars for 
the STG produced by the dHPF compiler for three HPF benchmarks: Tom- 
catv (from SPEC92), jacobi (a simple 2D Jacobi iterator PDE solver), and expl 
(Livermore Loop # 18). The effect of condensing the task graph on reducing 
the number of loops (do-NODe) and computational tasks (comp-task) can be 
observed. After condensing, most of the remaining tasks are either if-nodes 
and dummy nodes (e.g., endif-NODE, etc.) or communication tasks (which 
are never condensed), since we opted for a detailed representation of com- 
munication behavior, rather than compromise on the accuracy of the repre- 
sentation. The compiler generated task graphs for the above can be found at 
http : //www . cs .man. ac .uk/~rizos/taskgraph/ 
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tomcatv 


jacobi 


expl 


Lines of source HPF program 


227 


64 


94 


Lines of output parallel MPI program 


1850 


1156 


3722 




1st pass 


condensed 


1st pass 


condensed 


1st pass 


condensed 


Total number of tasks 


247 


193 


122 


83 


225 


174 


# COMM-NODE 


54 


54 


36 


36 


114 


114 


# DO-NODE 


18 


3 


13 


1 


17 


3 


# COMP-TASK 


39 


20 


16 


5 


23 


6 



Table 1. Size of STG for various example codes before and after condensing. 

The most important application of our compiler-synthesized task graphs to 
date has been for improving the state of the art of parallel simulation of message- 
passing programs |n|. Those results are briefly summarized here because they 
provide the best illustration of the correctness and benefits of the compiler- 
synthesized task graphs. This work was performed in collaboration with the 
parallel simulation group at UCLA, using MPI-Sim, a direct-execution parallel 
simulator for MPI programs nm. 

The basic strategy in using the STG for program simulation is to generate an 
abstracted MPI program from the STG where all computation corresponding to 
a computational task is eliminated, except those computations whose results are 
required to determine the control-flow, communication behavior, and task scaling 
functions. We refer to the eliminated computations as redundant computations 
(from the point of view of performance prediction), and we use the task scaling 
functions to generate simple symbolic estimates for their execution time. The 
simulator can avoid simulating the redundant computations, and simply needs 
to advance the simulation clock by an amount equal to the estimated execution 
time of the computation. The simulator can even avoid allocating memory for 
program data that is referenced only in redundant computations. 

The key to implementing this strategy lies in identifying non-redundant com- 
putations. To do so, we must first identify the values in the program that deter- 
mine program performance. These are exactly those variable values that appear 
in the control-flow expressions, communication descriptors, and scaling functions 
(both for task times and for communication volume) of the STG. Thus, using 
the STG makes these values very easy to identify. We can then use a standard 
program slicing algorithm HS| to isolate the computations that affect these val- 
ues. We then generate the simplified MPI code by including all the control-flow 
that appears in the static task graph, all the communication calls, and the non- 
redundant computations identified by program slicing. All the remaining (i.e., 
redundant) computations in each computational task are replaced with a single 
call to a special simulator delay function which simply advances the simulator 
clock by a specified amount. The argument to this function is a symbolic expres- 
sion for the estimated execution time of the redundant computation. Note that 
the simulator continues to simulate the communication behavior in detail. 

We have applied this methodology both to HPF programs (compiled to MPI 
by the dHPF compiler), and also to existing MPI programs (in the latter case. 
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Benchmark 


% Error in prediction vs. measurement 


#Procs = 4 


8 


16 


32 


64 


Tomcatv 


-5.44 


15.75 


11.79 


8.50 


9.27 


SweepSD 


-7.01 


-4.97 


9.02 


9.80 


5.13 


NAS SP, class A 


-2.59 




-1.24 


7.11 


6.10 


NAS SP, class C 






0.09 


-14.01 


-1.58 



Table 2. Validation of the compiler-generated task graphs using MPI-Sim. 



generating the abstracted MPI program by hand). The benchmarks include an 
HPF version of Tomcatv, and MPI versions of SweepSD (a key ASCI bench- 
mark) and NAS SP. Table El shows the percentage error in the execution times 
predicted by MPI-Sim using the simplified MPI code, compared with direct pro- 
gram measurement. As can be seen, the error was less than 16% in all cases 
tested, and less than 10% in most of these cases. This is important because the 
simplified MPI program can be thought of as simply an executable represen- 
tation of the static task graph itself. These results show that the task graph 
abstraction very accurately captures the properties of the program that deter- 
mine performance. We believe that the errors observed could be further reduced 
by applying more sophisticated techniques for estimating the execution time of 
redundant computations, particularly with simple estimates of cache behavior. 

The benefits of using the task graph based simulation strategy were extremely 
impressive. For these benchmarks, the optimized simulator requires factors of 5 
to 2000 less memory and up to a factor of 10 less time to execute than the 
original simulator. These dramatic savings allow us to simulate systems and 
problem sizes 10 to 100 times larger than is possible with the original simulator. 
Also, they have allowed us to simulate MPI programs for parallel architectures 
with hundreds of processors faster than real-time, and have made it feasible to 
simulate execution of programs on machines with 10,000-1- processors. These 
results are described in more detail in jS]. 

5 Related Work 

There is a large body of work on the use of task graphs for various aspects of par- 
allel systems but very little work on synthesizing task graphs for general-purpose 
parallel programs. The vast majority of performance models that use task graphs 
as inputs generally do not specify how the task graph should be constructed but 
assume that this has been done prrairmiEH|i77j. The various compiler-based 
systems that use task graphs, namely PYRROS |2H], CODE HENCE |21j, 
and Jade )2Uj construct task graphs by assuming special information from the 
programmer. In particular, PYRROS, CODE and HENCE all assume that the 
programmer specifies the task graph explicitly (CODE and HENCE actually 
use a graphical programming language to do so). In Jade, the programmer spec- 
ifies input and output variables used by each task and the compiler uses this 
information to deduce the task dependences for the program. 
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The PlusPYR project ^3 has developed a task graph representation that has 
some similarities with ours (in particular, symbolic integer sets and mappings for 
describing task instances and communication and synchronization rules), along 
with compiler techniques to synthesize these task graphs. The key difference 
from our work is that they start with a limited class of sequential programs 
(annotated to identify the tasks) and use dependence analysis to compute de- 
pendences between tasks, and then derive communication and synchronization 
rules from these dependences. Therefore, their approach is essentially a form of 
simple automatic parallelization. In contrast, our goal is to generate task graphs 
for existing parallel programs with no special program annotations and with 
explicit communication. A second major difference is that they assume a simple 
parallel execution model in which a task receives all inputs from other tasks in 
parallel and sends all outputs to other tasks in parallel. In contrast, we capture 
much more general communication behavior in order to support realistic HPF 
and MPI programs. 

Parashar et al. m construct task graphs for HPF programs compiled by the 
Syracuse Fortran 90D compiler, but they are limited to a very simple, loosely 
synchronous computational model that would not support many message-passing 
and HPF programs in practice. In addition, their interpretive framework for per- 
formance prediction uses functional interpretation for instantiating a dynamic 
task graph, which is similar to our approach for instantiating control-flow. Like 
the task graph model, their interpretation and performance estimation are sig- 
nificantly simplified (compared with ours) because of the loosely synchronous 
computational model. For example, they do not need to capture sophisticated 
communication patterns and computation partitionings, as we do using code 
generation from integer sets. 

Dikaiakos et al. US] developed a tool called FAST that constructs task graphs 
from user-annotated parallel programs, performs advanced task scheduling and 
then uses abstract simulation of message passing to predict performance. The 
PACE project [2S| proposes a language and programming environment for par- 
allel program performance prediction. Users are required to identify parallel 
subtasks and computation and communication patterns. Finally, Fahringer HSI, 
Armstrong and Eigenmann |0| , Mendes and Reed m and many others have de- 
veloped symbolic compile-time techniques for estimating execution time, com- 
munication volume and other metrics. The communication and computation 
scaling functions available in our static task graph are very similar to the sym- 
bolic information used by these techniques, and could be directly extended to 
support their analytical models. 

6 Conclusion and Future Plans 

In this paper, we described a methodology for automating the process of synthe- 
sizing task graphs for parallel programs, using sophisticated parallelizing com- 
piler techniques. The techniques in this paper can be used without user inter- 
vention to construct task graphs message-passing programs compiled from HPF 
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source programs, and we believe they extend directly to existing message-passing 
(e.g., MPI) programs as well. Such techniques can make a large body of existing 
research based on task graphs and equivalent representations applicable for these 
widely used programming standards. Our immediate goals for the future are: (1) 
to demonstrate that the techniques described in this paper can be applied to 
message-passing programs (using MPI), by extracting the requisite computation 
partitioning and communication information; and (2) to couple the compiler- 
generated task graphs with the wide range of modeling approaches being used 
within the POEMS project, including analytical, simulation and hybrid models. 
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Abstract. This paper describes how the use of software libraries, which is preva- 
lent in high performance computing, can beneht from compiler optimizations in 
much the same way that conventional programming languages do. We explain 
how the compilation of these informal languages differs from the compilation 
of more conventional languages. In particular, such compilation requires precise 
pointer analysis, domain-specihc information about the library’s semantics, and a 
conhgurable compilation scheme. We describe a solution that combines dataflow 
analysis and pattern matching to perform configurable optimizations. 



1 Introduction 

High performance computing, and scientific computing in particular, relies heavily on 
software libraries. Libraries are attractive because they provide an easy mechanism for 
reusing code. Moreover, each library typically encapsulates a particular domain of ex- 
pertise, such as graphics or linear algebra, and the use of such libraries allows pro- 
grammers to think at a higher level of abstraction. In many ways, libraries are infor- 
mal domain-specific languages whose only syntactic construct is the procedure call. 
This procedural interface is significant because it couches these informal languages in 
a familiar form without imposing new syntax. Unfortunately, libraries are not viewed 
as languages by compilers. With few exceptions, compilers treat each invocation of a 
library routine the same as any other procedure call. Thus, many optimization opportu- 
nities are lost because the semantics of these informal languages are ignored. 

As a trivial example, an invocation of the C standard math library’s exponentia- 
tion function, pow (a , b) , can be simplified to 1 when its second argument is 0 . This 
paper argues that there are many such opportunities for optimization, if only compil- 
ers could be made aware of a library’s semantics. These optimizations, which we term 
library-level optimizations, include choosing specialized library routines in favor of 
more general ones, eliminating unnecessary library calls, moving library calls around, 
and customizing the implementation of a library routine for a particular call site. 

Figure E shows our system architecture for performing library-level optimiza- 
tions in. In this approach, annotations capture semantic information about library 
routines. These annotations are provided by a library expert and placed in a separate 
file from the source code. This information is read by our compiler, dubbed the Broad- 
way compiler, which performs source-to-source optimizations of both the library and 
application code. 

* This work was supported in part by NSF CAREERS Grant ACI-9984660, DARPA Contract 
#F30602-97-l-0150 from the US Air Force Research Labs, and an Intel Fellowship. 
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Fig. 1. Architecture of the Broadway Compiler system 



This system architecture offers three practical henefits. First, because the annota- 
tions are separate from the library source, our approach applies to existing libraries and 
existing applications. Second, the annotations describe the library, not the application, 
so the application programmer does nothing more than use the Broadway Compiler 
in place of a standard C compiler. Finally, the non-trivial cost of writing the library 
annotations can be amortized over many applications. 

This architecture also provides an important conceptual benefit: a clean separation 
of concerns. The compiler encapsulates all compiler analysis and optimization ma- 
chinery, while the annotations describe all library knowledge and domain expertise. 
Together, the annotations and compiler free the applications programmer to focus on 
application design rather than on performing manual library-level optimizations. 

The annotation language faces two competing goals. To provide simplicity, the lan- 
guage needs to have a small set of useful constructs that apply to a wide range of soft- 
ware libraries. At the same time, to provide power, the language has to convey sufficient 
information for the Broadway compiler to perform a wide range of optimizations. The 
remainder of this paper will focus on the annotation language and its requirements. 

This paper makes the following contributions. ( 1 ) We define the distinguishing char- 
acteristics of library-level optimizations. (2) We describe the implications of these char- 
acteristics for implementing library-level optimizations in a compiler. (3) We present 
formulations of dataflow analysis and pattern matching that address these implications. 
(4) We extend our earlier annotation language Cl to support the configuration of the 
dataflow analyzer and pattern matcher. 



2 Opportunities 

This section characterizes library-level optimizations and their requirements. 

Conceptually similar to traditional optimizations. Library-level optimizations are con- 
ceptually similar to traditional optimizations, which can be grouped into the following 
classes of optimizations. (1) Eliminate redundant computations: Examples include par- 
tial redundancy elimination, common subexpression elimination, loop-invariant code 
motion, and value numbering. (2) Perform computations at compile-time: This may be 
as simple as constant folding or as complex as partial evaluation (3) Exploit special 
cases: Examples include algebraic identities and simplifications, as well as strength re- 
duction. (4) Schedule code: For example, exploit non-blocking loads and asynchronous 
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I/O operations to hide the cost of long-latency operations. (5) Enable other improve- 
ments: Use transformations such as inlining, cloning, loop transformations, and lower- 
ing the internal representation. These same categories form the basis of our library-level 
optimization strategy. In some cases, the library-level optimizations are identical to their 
classical counterparts. In other cases we need to reformulate the optimizations for the 
unique requirements of libraries. 

Significant opportunities exist. Most libraries receive no support from traditional op- 
timizers. For example, the code fragments in Figure Q illustrate the untapped opportu- 
nities of the standard C Math Library. A conventional C compiler will perform three 
optimizations on the built-in operators: (1) strength reduction on the computation of 
dl, (2) loop-invariant code motion on the computation of d3, and (3) replacement of 
division with bitwise right-shift in the computation of inti. The resulting optimized 
code is shown in the middle fragment. However, there are three analogous optimiza- 
tion opportunities on the math library computations that a conventional compiler will 
not discover: (4) strength reduction of the power operator in the computation of d2, (5) 
loop-invariant code motion on the cosine operator in the computation of d4, and (6) 
replacement of sine divided by cosine with tangent in the computation of d5. The code 
fragment on the right shows the result of applying these optimizations. 



Original Code 



for (i= 


=1; i<= 


:N; i++) 


[ 


dl = 


2.0* 


i ; 


a 


d2 = 


pow (x, 


i) ; 




d3 = 


1.0/z; 




a 


d4 = 


COS (z) 


• 




inti 


= i/4; 




a 


d5 = 


sin (y) 


/cos (y) ; 





} 



Conventional 

dl = 0.0; 
d3 = 1.0/z; 

for (i=l; i<=N; i++) { 

dl += 2.0; 

d2 = pow (x, i) ; 3 

d4 = cos (z) / 

inti = i >> 2; 

d5 = sin (y) /cos (y) ; 0 

} 



Library-level 

dl = 0.0; 
d2 = 1.0; 
d3 = l.O/z; 
d4 = cos (z) ; 
d5 = tan (y) ; 
for (i=l; i<=N; i++) { 
dl += 2.0; 
d2 *= x; 
inti = i >> 2; 

I 



Fig. 2. A conventional compiler optimizes built-in math operators, but not math library 
operators. 



Significantly, each application of a library-level optimization is likely to yield much 
greater performance improvement than the analogous conventional optimization. For 
example, removing an unnecessary multiplication may save a few cycles, while remov- 
ing an unnecessary cosine computation may save hundreds or thousands of cycles. 

Specialized routines are difficult to use. Many libraries provide a basic interface which 
provides basic functionality, along with an advanced interface that provides special- 
ized routines that are more efficient in certain circumstances HI- For example, the 
MPI message passing interface [O provides 12 variations of the basic point-to-point 
communication operation. These advanced routines are typically more difficult to use 
than the basic versions. For example, MPFs Ready Send routines assume that the com- 
municating processes have already been somehow synchronized. These specialized rou- 
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tines represent an opportunity for library-level optimization, as a compiler would ideally 
translate invocations of basic routines to specialized routines. 

Domain-specific analysis is required. Most libraries provide abstractions that can be 
useful for performing optimizations. For example, the PLAPACK parallel linear algebra 
library mi manipulates linear algebra objects indirectly though handles called views. 
A view consists of data, possibly distributed across processors, and an index range that 
selects some of the data. While most PLAPACK procedures are designed to accept 
any type of view, the actual parameters often have special distributions. Recognizing 
and exploiting these special distributions can yield significant performance gains Q. 
For example, in some cases, calls to the general-purpose PLA.Trsm ( ) routine can be 
replaced with calls to a specialized routine, PLA_Trsm_Local ( ) , which assumes the 
matrix view resides completely on a single processor. This customized routine can run 
as much as three times faster Q. 

The key to this optimization is to analyze the program to discover the special case 
matrix distributions. While compilers can perform many kinds of dataflow analysis, 
most compilers have no notion of “matrix,” let alone PLAPACK’s particular notion 
of matrix distributions. Thus, to perform this kind of optimization, there must be a 
mechanism for telling the compiler about the relevant abstractions and for facilitating 
program analysis in those terms. 

Challenges. To summarize, while there are many opportunities for library-level opti- 
mization, there are also significant challenges. First, while library-level optimizations 
are conceptually similar to traditional optimizations, library routines are typically more 
complex than primitive language operators. Second, a library typically embodies a 
high-level domain of computation whose abstractions are not represented in the base 
language, and effective optimizations are often phrased in terms of these abstractions. 
Third, the compiler cannot be hardwired for every possible library because each library 
requires unique analyses and optimizations. Instead, all of these facilities need to be 
configurable. The next two sections address these issues in more detail. 

3 Dependence Analysis 

Almost any kind of program optimization requires a model of dataflow dependences to 
preserve the program’s semantics. The use of pointers, and particularly pointer-based 
data structures, can greatly complicate dependence analysis, as pointers can make it dif- 
flcult to determine which memory objects are actually modifled. While many solutions 
to the pointer analysis problem have been proposed, we now argue that optimization at 
the library level requires the most precise and most aggressive. 

3.1 Why Libraries Need Pointers 

Libraries use pointers for two main reasons. The first is to overcome the limitations 
of the procedure call mechanism, as pointers allow a procedure to return more than 
one value. The second reason is to build and manipulate complex data structures that 
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represent the library’s domain-specific programming abstractions. It is important to un- 
derstand these structures because data dependences may exist between internal compo- 
nents of these data structures which, if violated, could change the program’s behavior. 
As an example, consider the following PLAPACK routine: 

PLA_Ob j_horz_split_2 (size, A, &A_upper, &A_lower) ; 

This routine logically splits the matrix into two pieces by returning objects that rep- 
resent ranges of the original matrix index space. Internally, the library defines a view 
data structure that consists of minimum and maximum values for the row and column 
indices, and a pointer to the actual matrix data. To see how this complicates analysis, 
consider the following code fragment: 

PLA_Obj A, A_upper, A_lower, B; 

PLA_Create_matrix (num_rows , num_cols, &A) ; 

PLA_Ob j_horz_split_2 (size, A, &A_upper, &A_lower) ; 

B = A_lower; 

The first line declares four variables of type PLA_Ob j , which is an opaque pointer to a 
view data structure. The second line creates a new matrix (both view and data) of the 
given size. The third line creates two views into the original data by splitting the rows 
into two groups, upper and lower. The fourth line performs a simple assignment of one 
view variable to another. FigureElshows the resulting data structures graphically. 



Program variables: 



Internal heap objects: 




Fig. 3. Library data structures have complex internal structure. 



The shaded objects are never visible to the application code, but accesses and mod- 
ifications to them are still critical to preserving program semantics. For example, re- 
gardless of whether A, A_lower, or B is used, the compiler cannot change the order of 
library calls that update the data. 

3.2 Pointer Analysis for Library-Level Optimizations 

Pointer analysis attempts to determine whether there are multiple ways to access the 
same piece of memory. We categorize the many approaches to pointer analysis along the 
following dimensions: (1) points-to versus alias representation, (2) heap model, and (3) 
flow and contexf sensitivify. This section describes the characteristics of our compiler’s 
pointer analysis and explains why they are appropriate for library-level optimization. 
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Representation. Heap allocation is common for library code because it allows data 
structures to exist throughout the life of the application code and it supports complex 
and dynamic structures. By necessity, components of these structures are connected by 
pointers, so a points-to relation provides a more natural model than alias pairs. Further- 
more, in C the only true variable aliasing occurs between the fields of a union; all other 
aliases occur through pointers and pointer expressions. 

Heap Model. The heap model determines the granularity and naming of heap allocated 
memory. Previous work demonstrates a range of possibilities including (1) one object 
to represent the whole heap 01, (2) one object for each connected data structure O, 
(3) one object for each allocation call site and (4) multiple objects for a malloc call 
site with a hxed limit (“k-limiting”) In the matrix split example above, we need a 
model of the heap that distinguishes the views from the data. Thus, neither of the first 
two approaches is sufficiently precise. We choose approach (3) because it is precise, 
without the unnecessary complexity of the k-limiting approach. 

Often, a library will provide only one or two functions that allocate new data struc- 
tures. For example, the above PLA_matrix_create function creates all matrices in the 
system. Thus, if we associate only one object (or one data structure, e.g., one view and 
one data object) with the call site, we cannot distinguish between individual instances 
of the data structure. In the example, we would have one object to represent all views 

created by the split operation, preventing us from distinguishing between view_2 

and view s. Therefore, we create a new object in the heap model for each unique 

execution path that leads to the allocation statement. 

This naming system leads to an intuitive heap model where objects in the model 
often represent conceptual categories, such as “the memory allocated by foo().” Note 
that when allocation occurs in a loop, all of the objects created during execution of the 
loop are represented by one object in the heap model. 

Context and Flow Sensitivity. Libraries are mechanisms for software reuse, so library 
calls often occur deep in the application’s call graph, with the same library functions 
repeatedly invoked from different locations. Without context and flow sensitivity, the 
analyzer merges information from the different call sites. For example, consider a sim- 
ple PLAPACK program that makes two calls to the split routine with different matrices: 

PLA_Ob j_horz_split_2 (size, A, &A_upper, &A_lower) ; 

PLA_Ob j_horz_split_2 (size, B, &B_upper, &B_lower) ; 

Context insensitive analysis concludes that all four outputs might point to either a’s 
data or b’s data. While this information is conservatively correct, it severely limits op- 
timization opportunities by creating an unnecessary data dependence. Any subsequent 
analyses that use this information suffers the same merging of information. For exam- 
ple, if the state of A is unknown, then analysis cannot safely infer that B_upper and 
B_lower contain all zeros, even if analysis concludes that B contains all zeros before 
the split. 

The more reuse that occurs in a system, the more important it is to keep information 
separate. Thus, we implement full context sensitivity. While this approach is complex. 
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recent research shows that efficient implementations are possible m- Recent work 
also shows that more precise pointer analysis not only causes subsequent analyses to 
produce more precise results, but also causes them to run faster liT^l 

3.3 Annotations for Dependence Analysis 

Our annotation language provides a mechanism for explicitly describing how a routine 
affects pointer structures and dataflow dependences. This information is integrated into 
the Broadway compiler’s dataflow and pointer analysis framework. The compiler builds 
a uniform representation of dependences, regardless of whether they involve built-in 
operators or library routines. When annotations are available, our compiler reads that 
information directly. For the application and for libraries that have not been annotated, 
our compiler analyzes the source code using the pointer analysis described above. 



procedure PLA_Obj_horz_split_2 ( obj , height, upper, lower) 

{ 

on_entry { obj --> view_l, DATA of view_l --> data } 

access { view_l, height } 

modify { } 

on_exit { upper --> new view_2, DATA of view_2 --> data, 

lower --> new view 3, DATA of view 3 --> data } 



) 



Fig. 4. Annotations for pointer structure and dataflow dependences. 



Figure E] shows the annotations for the PLAPACK matrix split routine. In some 
cases, a compiler could derive such information from the library source code. However, 
there are situations when this is impossible or undesirable. Many libraries encapsulate 
functionality for which no source code is available, such as low-level I/O or interprocess 
communication. Moreover, the annotations allow us to model abstract relationships that 
are not explicitly represented through pointers. For example, a file descriptor logically 
refers to a file on disk through a series of operating system data structures. Our annota- 
tions can explicitly represent the file and its relationship to the descriptor, which might 
make it possible to recognize when two descriptors access the same file. 

Pointer Annotations: on_entry and on_exit. To convey the effects of the procedure 
on pointer-based data structures, the on_entry and on_exit annotations describe the 
pointer configuration before and after execution. Each annotation contains a list of ex- 
pressions of the following form: 

[ label of ] identifier - - > [ new ] identifier 

The - - > operator, with an optional label, indicates that the object named by the iden- 
tifier on the left logically points to the object named on the right. In the on entry 
annotations, these expressions describe the state of the incoming arguments and give 
names to the internal objects. In the on exit annotations, the expressions can create 
new objects (using the new keyword) and alter the relationships among existing objects. 
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In Figure^ the on_entry annotation indicates that the formal parameter obj is 

a pointer and assigns the name view l to the target of the pointer. The annotation 

also says that view_l is a pointer that points to data. The on_exit annotation 

declares that the split procedure creates two new objects, view_2 and view_3. 

The resulting pointer relationships correspond to those in Figure 0 

Dataflow Annotations: access and modify. The access and modify annotations de- 
clare the objects that the procedure accesses or modifies. These annotations may refer 
to formal parameters or to any of the internal objects introduced by the pointer annota- 
tions. The annotations in Figure0show that the procedure uses the length argument 

and reads the input view view l. In addition, we automatically add the accesses and 

modifications implied by the pointer annotations: a dereference of a pointer is an access, 
and setting a new target is a modification. 

3.4 Implications 

As described in Section 0 most library-level optimizations require more than just de- 
pendence information. However, simply having the proper dependence analysis infor- 
mation for library routines does enable some classical optimizations. For example, by 
separating accesses from modifications, we can identify and remove library calls that 
are dead code. To perform loop invariant code motion or common subexpression elim- 
ination, the compiler can identify purely functional routines by checking whether the 
objects accessed are different from those that are modified. 

4 Library-Level Optimizations 

Our system performs library-level optimizations by combining two complementary 
tools: dataflow analysis and pattern-based transformations. Patterns can concisely spec- 
ify local syntactic properties and their transformations, but they cannot easily express 
properties beyond a basic block. Dataflow analysis concisely describes global context, 
but cannot easily describe complex patterns. Both tools have a wide range of realiza- 
tions, from simple to complex, and by combining them, each tool is simplified. Both 
tools are configurable, allowing each library to have its own analyses and transforma- 
tions. 

The patterns need not capture complex context-dependent information because such 
information is better expressed by the program analysis framework. Consider the fol- 
lowing fragment from an application program: 

PLA_0bj_horz_split_2 ( A, size, &A_upper, &A_lower) ; 

if ( is_Local (A_upper) ) { ... } else { ... } 

The use of PLA_Obj_horz_split_2 ensures that A_upper resides locally on a single 
processor. Therefore, the condition is always true, and we can simplify the subsequent 
i f statement by replacing it with the then-branch. The transformation depends on two 
conditions that are not captured in the pattern. First, we need to know that the is Local 
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function does not have any side-effects. Second, we need to track the library-specific 
local property of A_upper through any intervening statements to make sure that it is 
not invalidated. It would be awkward to use patterns to express these conditions. 

Our program analysis framework fills in fhe missing capabilities, giving the patterns 
access to globally derived information such as data dependences and control-flow in- 
formation. We use abstract interpretation to further extend these capabilities, allowing 
each library to specify its own dataflow analysis problems, such as the matrix distri- 
bution analysis needed above. This approach also keeps the annotation language itself 
simple, making it easier to express optimizations and easier to get them right. The fol- 
lowing sequence outlines our process for library-level optimization: 

1 . Pointers and dependences. Analyze the code using combined pointer and depen- 
dence analysis, referring to annotations when available to describe library behavior. 

2. Classical optimizations. Apply traditional optimizations that rely on dependence 
information only. 

3. Abstract interpretation. Use the dataflow analysis framework to derive library- 
specific properties as specified by the annotations. 

4. Patterns. Search the program for possible optimization opportunities, which are 
specified in fhe annofafions as synfactic patferns. 

5. Enabling conditions. For patterns that match, the annotations can specify addi- 
tional constraints, which are expressed in terms of data dependences and the results 
of the abstract interpretation. 

6. Actions. When a pattern satisfies all the constraints, the specified code transforma- 
tion is performed. 

4.1 Configurable Dataflow Analysis 

This section describes our configurable interprocedural dataflow analysis framework. 
The annotations can be used to specify a new analysis pass by defining both the library- 
specific flow value and the associated transfer functions. For every new analysis pass, 
each library routine is supplied a transfer function that represents its behavior. 

The compiler reads the specification and runs the given problem on our general- 
purpose dataflow framework. The framework takes care of propagating the flow values 
through the flow graph, applying the transfer functions, handling control-flow such as 
conditionals and loops, and testing for convergence. Once complete, it stores the final, 
sfable flow values for each program poinf. 

While other configurable program analyzers exist, ours is tailored specifically for 
library-level opfimizafion. First, we would like library experts, not compiler experts, 
to be able to define their own analyses. Therefore, the specification of new analysis 
problems is designed to be simple and intuitive. Second, we do not intend to support 
every possible analysis problem. The annotation language provides a small set of flow 
value types and operators, which can be combined to solve many useful problems. The 
lattices implied by these types have predefined meet functions, allowing us to hide the 
underlying lattice theory from the annotator. 

For library-level optimization, the most useful analyses seem to fall into three rough 
categories: (1) analyze the objects used by the program and classify them into library- 
specific categories, (2) track relationships among those objects, and (3) represent the 
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overall state of the computation. The PLAPACK distribution analysis is an instance of 
the first kind. To support these different kinds of analysis, we propose a simple type 
system for defining flow values. 

Flow Value Types. Our flow value type system consists of primitive types and simple 
container types. A flow value is formed by combining a container type with one or two 
of the primitive types. Each flow value type includes a predefined meef function, which 
defines how instances of the type are combined at control-flow merge points. The other 
operations are used to define the transfer functions for each library routine. 

number The number type supports basic arithmetic computations and comparisons. 
The meet function is a simple comparison: if the two numbers are not equal, then 
the result is lattice bottom. 

object The object type represents any memory location, including global and stack 
variables, and heap allocated objects. The only operation supported tests whether 
two expressions refer to the same object. 

statement The statement type refers to points in the program. We can use this type to 
record where computations take place. Operations include tests for equality, domi- 
nance and dependences. 

category Categories are user-defined enumerated types that support hierarchies. For 
example, we can define a Vehicle categorization like this: 

{ Land { Car, Truck { Pickup, Semi}}, Wa- 
ter { Boat, Submarine}} 

The meet function chooses the most specific category that includes the two given 
values. For example, the meet of Pickup and Semi yields Truck, while the meet of 
Pickup and Submarine yields lattice bottom. The resulting lattice forms a simple 
tree structure, as shown in Figure|3 



Top 



Ci 




Land 



Water 




Bottom 



Fig. 5. The lattice induced by the example vehicle categories. 



Operations on the categories include testing for equality and for category member- 
ship. For example, we may want to know whether something is a Truck without 
caring about its specific subtype. 
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set-of<T> Set is the simplest container: it holds any number of instances of the primi- 
tive type T. We can add and remove elements from the set, and test for membership. 
We support two possible meet functions for sets: set union for “optimistic” anal- 
yses, and set intersection for “pessimistic” analyses. It is up to the annotator to 
decide which is more appropriate for the specific analysis problem. 
equivalence-of<T> This container maintains an equivalence relation over the ele- 
ments that it contains. Operations on this container include adding pairs of elements 
to indicate that they are equivalent, and removing individual elements. The basic 
query operator tests whether two elements are equivalent by applying the transitive 
closure over the pairs that have been added. Like the set container, there are two 
possible meet functions, one optimistic and one pessimistic, that correspond to the 
union and intersection of the two relations. 

ordering-of<T> The ordering container maintains a partial order over the elements 
it contains. We add elements in pairs, smaller element and larger element, to in- 
dicate ordering constraints. We can also remove elements. Like the equivalence 
container, the ordering container allows ordering queries on the elements it con- 
tains. In addition, it ensures that the relation remains antisymmetric by removing 
cycles completely. 

map-of<K,V> The map container maintains a mapping between elements of the two 
given types. The first type is the “key” and may only have one instance of a partic- 
ular value in the map at a time. It is associated with a “value.” 

We can model many interesting analysis problems using these simple types. The an- 
notations define an analysis using the property keyword, followed by a name and then 
a flow value type expression. Figure 0 shows some example property definitions. The 
first one describes the flow value for the PLAPACK matrix distribution analysis. The 
equivalence Aligned could be used to determine when matrices are suitably aligned 
on the processors. The partial order SubmatrixOf could maintain the relative sizes of 
matrix views. The last example could be used for MPI to keep track of the asynchronous 
messages that are potentially “in flight” at any given point in the program. 



property Distribution : 

map-of < object , { General { RowPanel, ColPanel, Local }, 

Vector, 

Empty } > 

property Aligned : pessimistic equivalence-of < object > 

property SubMatrixOf : ordering-of < object > 

property MessagesInFlight ; optimistic set-of < object > 



Fig. 6. Examples of the property annotation for defining flow values. 



Transfer Functions. For each analysis problem, transfer functions summarize the ef- 
fects of library routines on the flow values. Transfer functions are specified as a case 
analysis, where each case consists of a condition, which tests the incoming flow values. 
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and a consequence, which sets the outgoing flow value. Both the conditions and the 
consequences are written in terms of the functions available on the flow value type. 



procedure PLA_Obj_horz_split_2 ( obj , height, upper, lower) 

{ 

on_entry { obj — > view_l, DATA of view_l --> data } 

access { view_l, height } 

modify { } 



analyze Distribution { 

( view_l == General) => view_2 = RowPanel, view_3 = General; 

( view_l == RowPanel) = > view_2 = RowPanel, view_3 = RowPanel; 

( view_l == ColPanel) = > view_2 = Local, view_3 = ColPanel; 

( view_l == Local) = > view_2 = Local, view_3 = Empty; 

} 



on exit 



1 



{ upper 
lower 



— > new view_2, DATA of view_2 — > data, 

— > new view_3, DATA of view_3 — > data } 



Fig. 7. Annotations for matrix distribution analysis. 



Figure Q shows the annotations for the PLAPACK routine 
PLA_0bj_horz_split_2, including those that define the matrix distribution 
transfer function. The analyze keyword indicates the property to which the transfer 
function applies. We integrate the transfer function with the dependence annotations 
because we need to refer to the underlying structures. Distribution is a property of the 
views (see Section|3), not the surface variables. Notice the last case: if we deduce that 
a particular view is Empty, we can remove any code that computes on that view. 



4.2 Pattern-Based Transformations 

The library-level optimizations themselves are best expressed using pattern-based trans- 
formations. Once the dataflow analyzer has collected whole-program information, 
many optimizations consist of identifying and modifying localized code fragments. 
Patterns provide an intuitive and configurable way to describe these code fragments. 
In PLAPACK, for example, we use the results of the matrix distribution analysis to 
replace individual library calls with specialized versions where possible. 

Pattern-based transformations need to identify sequences of library calls, to check 
the call site against the dataflow analysis results, and to make modifications to the code. 
Thus, the annotations for pattern-based transformations consist of three parts: a pattern, 
which describes the target code fragment, preconditions, which must be satisfied, and 
an action, which specifies modifications to the code. The pattern is simply a code frag- 
ment that acts as a template, with special meta-variables that behave as “wildcards.” 
The preconditions perform additional tests on the matching application code, such as 
checking data dependences and control-flow context, and looking up dataflow analysis 
results. The actions can specify several different kinds of code transformations, includ- 
ing moving, removing, or substituting the matching code. 
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Patterns. The pattern consists of a C code fragment with meta-variahles that bind to 
different components in the matching application code. Our design is influenced by the 
issues raised in Section 0 Typical code pattern matchers work with expressions and 
rely on the tree structure of expressions to identify computations. However, the use of 
pointers and pointer-based data structures in the library interface presents a number of 
complications, and forces us to take a different approach. 

The parameter passing conventions used by libraries have several consequences for 
pattern matching. First, the absence of a functional interface means that a pattern cannot 
be represented as an expression tree; instead, patterns consist of a sequence of state- 
ments with data dependences among them. Second, the use of the address operator 
to emulate pass-by-reference semantics obscures those data dependences. Finally, the 
pattern instance in the application code may contain intervening, but computationally 
irrelevant statements. Figure]^ depicts some of the possible complications by showing 
what would happen if the standard math library did not have a functional interface. 



Basic Pattern 
y = sin (x) /cos (x) ; 

► 

y = tan (x) ; 



Non-functional Interface 

sin (x, &t0) ; 
cos (x, &tl) ; 
y = tO/tl; 



tan (x, &y) ; 



Access Complications 

p = ; 

sin (x, SctO) ; 
cos (*p, &tl) ; 
y = to/tl; 



tan(*p, &y) ; 



Fig. 8. The use of pointers for parameter passing complicates pattern matching. 



To address these problems, we offer two meta- variable types, one that matches ob- 
jects (both variables and heap-allocated memory), and one that matches constants. The 
object meta-variable ignores the different ways that objects are accessed. For example, 
in the third code fragment in Figure 0 the same meta variable would match both x 
and *p. The constant meta-variable can match a literal constant in the code, a constant 
expression, or the value of a variable if its value can be determined at compile time. 

For a pattern to match, the application code must contain the specified sequence of 
statements, respecting any data dependences implied by the meta-variable names. The 
matching sequence may contain intervening statements, as long as those statements 
have no dependences with that sequence. We would like to weaken this restriction in 
the future, but doing so raises some difficult issues for pattern substitution. 

Preconditions. The preconditions provide a way to test the results of the pointer anal- 
ysis and user-defined dataflow analyses, since these can’t be conveniently represented 
in the syntactic patterns. These dataflow requirements can be complicated for libraries, 
because important properties and dependences often exist between internal components 
of the data structures, rather than between the surface variables. For example, as shown 
in Figure El two different PLAPACK views may refer to the same underlying matrix 
data. An optimization may require that a sequence of PLAPACK calls all update the 
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same matrix. In this case the annotations need a way to access the pointer analysis in- 
formation and make sure that the condition is satisfied. To do this, we allow the precon- 
ditions to refer to the on_entry and on_exit annotations for the library routines in the 
pattern. To access the dataflow analysis results, the preconditions can express queries 
using the same flow value operators that the transfer functions use. For example, the 
preconditions can express constraints such as, “the view of matrix A is empty.” 

Actions. When a pattern matches and the preconditions are satisfied, the compiler can 
perform the specified optimization. We have found that the most common optimizations 
for libraries consist of replacing a library call or sequence of library calls with more spe- 
cialized code. The replacement code is specified as a code template, possibly containing 
meta variables, much like the patterns. Here, the compiler expands the embedded meta- 
variables, replacing them with the actual code bound to them. We also support queries 
on the meta-variables, such as the C datatype of the binding. This allows us to declare 
new variables that have the same type as existing variables. 

In addition to pattern replacement, we offer four other actions: (1) remove the 
matching code, (2) move the code elsewhere in the application, (3) insert new code, 
or (4) trigger one of the enabling transformations such as inlining or loop unrolling. 

When moving or inserting new code, the annotations support a variety of useful 
positional indicators that describe where to make the changes relative to the site of the 
matching code. For example, the earliest possible point and the latest possible point are 
defined by the dependences between the matching code and its surrounding context. 
Using these indicators, we can perform the MPI scheduling described in Section 0 
move the MPl_isend to the earliest point and the MPl_Wait to the latest point. Other 
positional indicators might include enclosing loop headers or footers, and the locations 
of reaching definitions or uses. Figure 0 demonstrates some of the annotations that use 
pattern-based transformations to optimize the examples presented in this paper. 

5 Related Work 

Our research extends to libraries previous work in optimization O, partial evalua- 
tion 013, abstract interpretation ElIIl, and pattern matching. This section relates our 
work to other efforts that provide configurable compilation technology. 

The Genesis optimizer generator produces a compiler optimization pass from a 
declarative specification of the optimization ifilll . Like Broadway, the specification uses 
patterns, conditions and actions. However, Genesis targets classical loop optimizations 
for parallelization, so it provides no way to define new program analyses. Conversely, 
the PAG system is a completely configurable program analyzer ||1 (S| that uses an ML- 
like language to specify the flow value lattices and transfer functions. While power- 
ful, the specification is low-level and requires an intimate knowledge of the underlying 
mathematics. It does not include support for actual optimizations. 

Some compilers provide special support for specific libraries. For example, seman- 
tic expansion has been used to optimize complex number and array libraries, essentially 
extending the language to include these libraries m. Similarly, some C compilers rec- 
ognize calls to malloc ( ) when performing pointer analysis. Our goal is to provide 
configurable compiler support that can apply to many libraries, not just a favored few. 
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pattern { ${obj :y} = sin ( ${obj :x} ) /cos (${obj :x} ) } 

{ 

replace { $y = tan($x) } 

} 



pattern { 

MPI_Isend( ${obj :buf fer} , ${expr:dest }, ${obj : req_ptr} ) 

} 

{ 

move ©earliest; 

} 



pattern { 

PLA_0bj_horz_split_2 { ${obj:A}, ${expr : size} , 

${obj :upper_ptr} , ${obj : lower_ptr}) 

} 

{ 

on_entry { A --> view_l/ DATA of view_l --> data } 

when (Distribution of view_l == Empty) remove; 

when (Distribution of view_l == Local) 

replace { 

PLA_obj_view_all ( $A, $upper_ptr) ; 

} 



Fig. 9. Example annotations for pattern-based transformations. 



Meta-programming systems such as meta-object protocols programmable syn- 
tax macros EHl . and the Magik compiler IflTl . can be used to create customized library 
implementations, as well as to extend language semantics and syntax. While these tech- 
niques can be quite powerful, they require users to manipulate AST’s and other compiler 
internals directly and with little dataflow information. 



6 Conclusions 

This paper has outlined the various challenges and possibilities for performing library- 
level optimizations. In particular, we have argued that such optimizations require pre- 
cise pointer analysis, domain-specific information, and a configurable compilation 
scheme. We have also presented an annotation language that supports such a compi- 
lation scheme. 

A large portion of our Broadway compiler has been implemented, including a flow- 
and context-sensitive pointer analysis, a configurable absfract interpretation pass, and 
the basic annotation language Cl without pattern matching. Experiments with this 
basic configuration have shown that significant performance improvements are possible 
for applications that use the PLAPACK library. One common routine, PLA_Trsm ( ) , 
was customized to improve its performance by a factor of three, yielding speedups of 
26% for a Cholesky factorization application and 9.5% for a Lyapunov program O- 

While we believe there is much promise for library-level optimizations, several 
open issues remain. We are in the process of defining the details of our annotation 
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language extensions for pattern matching, and we are implementing its associated pat- 
tern matcher. Finally, we need to evaluate the limits of our scheme — and of our use 
of abstract interpretation and pattern matching in particular — with respect to both opti- 
mization capabilities and ease of use. 
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Abstract. A Flat Neighborhood Network (FNN) is a new interconnec- 
tion network architecture that can provide very low latency and high 
bisection bandwidth at a minimal cost for large clusters. However, unlike 
more traditional designs, FNNs generally are not symmetric. Thus, al- 
though an FNN by definition offers a certain base level of performance for 
random communication patterns, both the network design and commu- 
nication (routing) schedules can be optimized to make specific commu- 
nication patterns achieve significantly more than the basic performance. 
The primary mechanism for design of both the network and communi- 
cation schedules is a set of genetic search algorithms (GAs) that derive 
good designs from specihcations of particular communication patterns. 
This paper centers on the use of these GAs to compile the network wiring 
pattern, basic routing tables, and code for specific communication pat- 
terns that will use an optimized schedule rather than simply applying 
the basic routing. 



1 Introduction 

In order to use compiler techniques to design and schedule use of FNNs, it is first 
necessary to understand precisely what a FNN is and why such an architecture 
is beneficial. Toward that, it is useful to briefly discuss how the concept of a 
FNN arose. Throughout this paper, we will use KLAT2 (Kentucky Linux Athlon 
Testbed 2), the first FNN cluster, as an example. Though not huge, KLAT2 is 
large enough to effectively demonstrate the utility of FNNs: it unites 66 Athlon 
PCs using a FNN consisting of 264 NICs (Network Interface Cards) and 10 
switches. 

There are two reasons that the processors of a parallel computer need to be 
connected: (1) to send data between them and (2) to agree on global properties 
of the computation. As we discussed in [1], the second functionality is not well- 
served using message-passing hardware. Here, we focus on the first concern. 
Further, we will restrict our discussion to clusters of PCs, since few people will 
have the opportunity to design their own traditional supercomputer’s network. 



S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 244- T?CT 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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Fig. la. Direct connections 



Fig. lb. Switchless connections 
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Fig. Ic. Ideal switch 
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Fig. le. Channel bonding 



Fig. If. FNN 
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Fig. Ig. FNN with uplink switch 
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Fig. Ih. FNN with folded uplink 



Fig. 1. Network topologies used in connecting cluster nodes 



In the broadest terms, we need to distinguish only six different classes of 
network topologies (and two minor variations on the last). These are shown in 
Fig. 1. 

The ideal network configuration would be one in which each processor is 
directly connected to every other node, as shown in Fig. la. Unfortunately, 
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for an iV-node system this would require N-1 connections for each node. Using 
standard motherboards and NICs, there are only bus slots for a maximum of 
4-6 NICs. Using relatively expensive 4-interface cards, the upper bound could 
be as high as 24 connections; but even that would not be usable for a cluster 
with more than 25 nodes. 

Accepting a limit on the number of connections per node, direct connections 
between all nodes are not possible. However, it is possible to use each node as 
a switch in the network, routing through nodes for some communications. In 
general, the interconnection pattern used is equivalent to some type of hyper- 
mesh; in Fig. lb, a degree 2 version (a ring) is pictured. Because NICs are 
generally cheaper than switches, this structure minimizes cost, but it also yields 
very large routing delays - very high latency. 

To minimize latency without resorting to direct connections, the ideal net- 
work would connect a high-bandwidth NIC in each node to a single wire-speed 
switch, as shown in Fig. Ic. For example, using any of the various Gb/s network 
technologies (Gigabit Ethernet [2], Myricom’s Myrinet [3], Giganet’s CLAN [4], 
Dolphin’s SCI [5]), it is now possible to build such a network. Unfortunately, the 
cost of a single Gb/s NIC exceeds the cost of a typical node, the switch is even 
more expensive per port, and wide switches are not available at all. Most Gb/s 
switches are reasonably cheap for 4 ports, expensive for 8 or 16 ports, and only 
a few are available with as many as 64 ports. Thus, this topology works only for 
small clusters. 

The closest scalable approximation to the single switch solution substitutes 
a hierarchical switching fabric for the single switch, as shown in Fig. Id. Some 
Gb/s technologies allow more flexibility than others in selecting the fabric’s in- 
ternal topology; for example. Gb/s Ethernet only supports a simple tree whereas 
Giganet CLAN can use higher-performance topologies such as fat trees - at the 
expense of additional switches. However, any switching fabric will have a higher 
latency than a single switch. Further, the bisection bandwidth of the entire sys- 
tem is limited to the lesser of the bisection bandwidth of the switch(es) at the 
top of the tree or the total bandwidth of the links to the top switch(es). This 
is problematic for Gb/s technologies because the QuplinksU that interconnect 
switches within the fabric are generally the same speed as the connections used 
for the NICs; thus, half of the ports on each switch must be used for uplinks to 
achieve the maximum bisection bandwidth. 

Fortunately, lOOMb/s Ethernet switches do not share this last problem: wire- 
speed lOOMb/s switches often have Gb/s uplinks. Thus, it is possible to build 
significantly wider switch fabrics that preserve bisection bandwidth at a rela- 
tively low cost. The problem is that a single lOOMb/s NIC per node does not 
provide enough bandwidth for many applications. Fig. le shows the standard 
Linux-supported solution: use multiple NICs per node, connect them to identi- 
cal fabrics, and treat the set of NICs in each node as a single, parallel, NIC. The 
software support for this, commonly known as Qchannel bonding,U was the pri- 
mary technical contribution of the original Beowulf project. Unfortunately, the 
switch fabric latency is still high and building very large clusters this way yields 
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the same bisection bandwidth and cost problems discussed for Gb/s systems 
built as shown in Fig. Id. Further, because channel-bonded NICs are treated as 
a single wide channel, the ability to send to different places by simultaneously 
using different NICs is not fully utilized. 

The Flat Neighborhood Network (FNN), shown in Fig. If, solves all these 
problems. Because switches are connected only to NICs, not to other switches, 
single switch latency and relatively high bisection bandwidths are achieved. Cost 
also is significantly lower. However, FNNs do cause two problems. The first is 
that some pairs of nodes only have single-NIC bandwidth between them with 
the minimum latency, although extra bandwidth can be obtained with a higher 
latency by routing through nodes to QhopU neighborhoods. The second problem 
is that routing becomes a very complex issue. 

For example, the first two machines in Fig. If have two neighborhoods (sub- 
nets) in common, so communication between them can be done much as it would 
be for channel bonding. However, that bonding of the first machine’s NICs would 
not work when sending a message to the third machine because those nodes share 
only one neighborhood. Even without the equivalent of channel bonding, rout- 
ing is complicated by the fact that the apparent address (NIC) for the third 
machine is different depending on which node is sending to it; the first machine 
would talk to the first NIC in the third node, but the last machine would talk to 
the second NIC. Further, although this very small example has some symmetry 
which could be used to simplify the specification of routing rules, that is not 
generally true of FNNs. 

At this point, it is useful to abstract the fully general definition of a FNN: a 
network using a topology in which all important (usually, but not necessarily, all) 
point-to-point communication paths are implemented with only a single switch 
latency. In practice, it is convenient to augment the FNN with an additional 
switch that connects to the uplinks from the FNN’s switches, since that switch 
can provide more efficient multicast support and I/O with external systems (e.g., 
workstations or other clusters). This second-level switch also can be a convenient 
location for Qhot sparelJ nodes to be connected. The FNN with this additional 
uplink switch is shown in Fig. Ig. 

In the special case that one of the FNN switches has sufficient ports available, 
it also is possible to QfoldU the uplink switch into one of the FNN switches. This 
folded uplink FNN configuration is shown in Fig. Ih. Although the example’s 4- 
port switches would not be wide enough to be connected as shown in this figure, 
if the switches are wide enough, it always is possible to design the network so 
that sufficient ports are reserved on one of the FNN switches. 

Thus, FNNs scale well, easily provide multicast and external I/O, and offer 
high performance at low cost. A more detailed evaluation of the performance 
of FNNs (especially in comparison to fat trees and channel bonding) is given 
in [9]. Independent of and concurrent with our work, a group at the Australian 
National University created a cluster {Bunyip [10]) whose network happens to 
have the FNN properties, and their work confirms the performance benefits. 
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The problem with FNNs is that they require clever routing. Further, their 
performance can be improved by tuning the placement of the paths with extra 
bandwidth so that they correspond to the communication patterns that are 
most important for typical applications. In other words, FNNs require compiler 
technology for analysis and scheduling (routing) of communication patterns in 
order to achieve their full performance. Making full use of this technology for 
both the design and use of FNNs yields significant benefits. 

Bunyip [10] uses a hand-designed symmetric FNN. However, the complexity 
of the FNN design problem explodes when a system is being designed with a set 
of optimization criteria. Optimization criteria range from information about rel- 
ative importance of various communication patterns to node physical placement 
cost functions (intended to reduce physical wiring complexity). Further, many of 
these criteria interact in ponderous ways that only can be evaluated by partial 
simulation of potential designs. It is for these reasons that our FNN design tools 
are based on genetic search algorithms (GAs). 

2 The FNN Compiler 

The first step in creating a FNN system is the design of the physical network. 
Logically, the design of the network is a function of two separate sets of con- 
straints: the constraints imposed by physical hardware and those derived from 
analysis of the communications that the resulting FNN is to perform. Thus, the 
compiler’s task is to parse specifications of these constraints, construct and exe- 
cute a GA that can optimize the design according to these constraints, and finally 
to encode the resulting design in a form that facilitates its physical construction 
and use. 

The current version of our network compiler uses: 

— A specification of how many PGs, the maximum number of NIGs per PG (all 
PGs do not have to have the same number of NIGs!), and a list of available 
switches specified by their width (number of ports available per switch). 
Additional dummy NIGs and/or switches are automatically created within 
the program to allow uneven use of real NIGs/switch ports. For example, 
KLAT2’s current network uses only 8 of 31 ports on one of its switches; the 
other switch ports appear to be occupied by dummy NIGs that were created 
by the program. 

— A designer-supplied evaluation function that returns a quality value derived 
by analysis of specific communication patterns and other performance mea- 
sures. This function also marks problem spots in the proposed network con- 
figuration so that they can be preferentially changed in the GA process. 

In the near future, we expect to distribute a version of the compiler which 
has been enhanced to additionally include: 

— A list of switch and NIG hardware costs, so that the selection of switches 
and NIG counts also can be automatically optimized. 

~ A clean language interface for this specification. 
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Currently, we modify the GA-based compiler itself by including C functions 
that redefine the search parameters. 



2.1 The GA Structure 

The GA is not a generic GA, but is highly specialized to the problem of designing 
the network. The primary data structure is a table of bitmasks for each PC; 
each PC’s bitmask has a 1 only in positions corresponding to each neighborhood 
(switch) to which that PC has a NIC connected. This data structure does not 
allow a PC to have multiple NICs connected to the same neighborhood, however, 
such a configuration would add nothing to the FNN connectivity. Enforcing this 
constraint and the maximum number of NICs greatly narrows the search space. 

Written in C, the GA’s bitmask data structure facilitates use of SIMD-within- 
a-register parallelism [6] when executed on a single processor. It also can be 
executed in parallel using a cluster. KLAT2’s current network design was actually 
created using our first Athlon cluster, Odie - four 600MHz Athlon PCs. 

To more quickly converge on a good solution, the GA is applied in two distinct 
phases. Large network design problems with complex evaluation functions are 
first converted into smaller problems to be solved for a simplified evaluation 
function. This rephrased problem often can be solved very quickly and then 
scaled up, yielding a set of initial configurations that will make the full search 
converge faster. 

The simplified cost weighting only values basic FNN connectivity, making 
each PC directly reachable from every other. The problem is made smaller by 
dividing both the PC count and the switch port counts by the same number 
while keeping the NICs per PC unchanged. For example, a design problem using 
24-port switches and 48 PCs is first scaled to 2-port switches and 4 PCs; if no 
solution is found within the alloted time, then 3-port switches and 6 PCs are 
tried, then 4-port switches and 8 PCs, etc. A number of generations after finding 
a solution to one of the simplified network design problems, the population of 
network designs is scaled back to the original problem size, and the GA resumes 
using the designer-specified evaluation function. 

If no solution is found for any of the scaled-down problems, the GA is directly 
applied to the full-size problem. 



2.2 The Genetic Algorithm Itself 

The initial population for the GA is constructed for the scaled-down problem 
using a very straightforward process in which each PC’s NICs are connected to 
the lowest-numbered switch that still has ports available and is not connected 
to the same PC via another NIC. Additional dummy switches are created if 
the process runs out of switch ports; similarly, dummy NICs are assigned to 
virtual PCs to absorb any unused real switch ports. The resulting scaled-down 
initial FNN design satisfies all the constraints except PC-to-PC connectivity. 
Because the full-size GA search typically begins with a population created from 
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a scaled-down population, it also satisfies all the basic design constraints except 
connectivity. 

By making all the GA transformations preserve these properties, the evalu- 
ation process checks only connectivity, not switch port usage, NIC usage, etc. 

The GA’s generation loop begins by evaluating all new members of a popu- 
lation of potential FNN designs. Determining which switches are shared by two 
PCs is a simple matter of bitwise AND of the two bitmasks; counting the ones 
in that result measures the available bandwidth. Which higher-level evaluation 
function is used depends on whether the problem has been scaled-down. The 
complete population is then sorted in order of decreasing fitness, so that the 
top KEEP entries will be used to build the next generation’s population. In order 
to ensure some genetic variety, the last FUDGE FNN designs that would be kept 
intact are randomly exchanged with others that would not have been kept. If a 
new FNN design is the best fit, it is reported. 

Aside from the GA using different evaluation functions for the full size and 
scaled-down problems, there are also different stopping conditions applied at 
this point in the GA. Since we cannot know what the precise optimum design’s 
value will be for full-size search, it terminates only when the maximum number 
of generations has elapsed. In contrast, the scaled-down search will terminate in 
fewer generations if a FNN design with the basic connectivity is found earlier in 
the search. 

Crossover is then used to synthesize CROSS new FNN designs by combining 
aspects of pairs of parent FNN designs that were marked to be kept. The proce- 
dure used begins by randomly selecting two different parent FNN designs, one 
of which is copied as the starting design for the child. This child then has a 
random number of substitutions made, one at a time, by randomly picking a PC 
and making its set of NIC connections match those for that PC in the other par- 
ent. This forced match process works by exchanging NIC connections with other 
PCs (which may be real or dummy PCs) in the child that had the desired NIC 
connections. Thus, the resulting child has properties taken from both parents, 
yet always is a complete specification of the NIC to switch mapping. In other 
words, crossover is based on exchange of closed sets of connections, so the new 
configuration always satisfies the designer-specified constraints on the number 
of NICs/PC and the number of ports for each switch. 

Mutation is used to create the remainder of the new population from the 
kept and crossover designs. Two different types of crossover operation are used, 
both applied a random number of times to create each mutated FNN design: 

1. The first mutation technique swaps individual NIC-to-switch connections 
between PCs selected at random. 

2. The second mutation technique simply swaps the connections of one PC with 
those of another PC, essentially exchanging PC numbers. 

Thus, the mutation operators are also closed and preserve the basic NIC and 
switch port design constraints. The generation process is then repeated with a 
population consisting of the kept designs from the previous generation, crossover 
products, and mutated designs. 
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2.3 The FNN Compiler’s Output 

The output of the FNN compiler is simply a table. Each line begins with a 
switch number followed by a : , which is then followed by the list of PC numbers 
connected to that switch. 

This list is given in sorted order, but for ideal switches, it makes no difference 
which PCs are connected to which ports, provided that the ports are on the 
correct switch. It also makes very little difference which NICs within a PC are 
connected to which switch. However, to construct routing tables, it is necessary 
to know which NICs are connected to each switch, so we find it convenient to 
also order the NICs such that, within each PC, the lowest-numbered NIC is 
connected to the lowest-numbered switch, etc. 

We use this simple text table as the input to all our other tools. Thus, the 
table could be edited, or even created, by hand. 

3 The FNN Translators 

Once the FNN compiler has created the network design, there are a variety 
of forms that this design must be translated into in order to create a working 
implementation. For this purpose, we have created a series of translators. 



3.1 Physical Wiring 

One of the worst features of FNNs is that they are physically difficult to wire. 
This is because, by design, they are irregular and often have very poor physical 
locality between switches and NICs. Despite this, wiring KLAT2’s PCs with 4 
NICs each took less than a minute per cable, including the time to neatly route 
the cables between the PC and the switches. 

The trick that allowed us to wire the system so quickly is nothing more 
than color-coding of the switches and NICs. As described above, all the ports 
on a switch can be considered interchangeable; it doesn’t matter which switch 
port a NIC is plugged into. Category 5 cable, the standard for Fast Ethernet, is 
available in dozens of colors at no extra cost. Thus, the problem is simply how 
to label the PCs with the appropriate colors for the NICs it contains. 

For this purpose, we created a simple program that translates the FNN switch 
connection representation into an HTML file. This file, which can be loaded into 
any WWW browser and printed, contains a set of per-PC color-coded labels 
that have a color patch for each NIC in the PC showing which color cable, and 
hence which switch, should be connected. KLAT2’s wiring, and the labels that 
were used to guide the physical process, are shown in Fig. 2. 

For KLAT2, it happens that half of our cables were transparent colors; the 
transparent colors are distinguished from the solid colors by use of a double 
triangle. Of course, a monochrome copy of this paper makes it difficult to identify 
specific colors, but the color-coding of the wires is obvious when the color-coded 
labels are placed next to the NICs on the back of each PC case, as you can see 
them in the photo in Fig. 2. 
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KLAT2’s flat neighborhood network 

Above: physical wiring 
Right: neighborhood pattern 
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Fig. 2. FNN wiring of KLAT2’s 64 PCs with 4 NICs each 



3.2 Basic Routing Tables 

In the days of the Connection Machines (CMs), Thinking Machines employees 
often could be heard repeating the mantra, “all the wires all the time.” The 
same focus applies to FNN designs: there is tremendous bandwidth available, 
but only when all the NICs are kept busy. There are two ways to keep all the 
wires busy. One way is to have single messages divided into pieces sent by all 
the NICs within a PC, as is done using channel bonding. The other way is to 
have transmission of several messages to different destinations overlap, with one 
message per NIC. Because FNNs generally do not have sufficient connectivity to 
keep all the wires busy using the first approach, the basic FNN routing centers 
on efficient use of the second. 

Although IP routing is normally an automatic procedure, the usual means 
by which it is automated do not work well using a FNN. Sending out broadcast 
requests to find addresses is an exceedingly bad way to use a FNN, especially if 
an uplink switch is used, because that will make all NICs appear to be connected 
rather than just the ones that share subnets. Worse still, some software systems, 
such as LAM MPI [7, 8] , try to avoid the broadcasts by determining the locations 
of PCs once and then passing these addresses to all PCs. That approach fails 
because each PC actually has several addresses (one per NIC) and the proper 
one to use depends on which PC is communicating with it. For example, in Fig. 
If, the first PC would talk to the third PC via its address on subnet 1, but the 
last PC would talk to it via the address on subnet 3. Thus, we need to construct 
a unique routing table for each PC. 
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To construct these routing tables, we must essentially select one path between 
each pair of PCs. According to the user-specified communication patterns, some 
PC-to-PC paths are more important than others. Thus, the assignments are 
made in order of decreasing path importance. However, the number of alternative 
paths between PCs also varies, so among paths of equal importance, we assign 
the paths with the fewest alternatives first. 

For the PC pairs that have only a single neighborhood in common, the selec- 
tion of the path is trivial. Once that has been done, the translator then examines 
PC pairs with two neighborhoods in common, and tries to select the the path 
whose NICs have thus far appeared in the fewest assigned paths. The process 
then continues to assign paths for pairs with three, then four, etc., neighborhoods 
in common. The complete process is then repeated for the next most important 
pairs, and so forth, until every pair has been assigned a path. 

KLAT2’s current network was designed partially optimizing row and column 
communication in an 8x8 logical configuration of the 64 processors (the two hot 
spares are on the uplink switch) . Although the translator software actually builds 
a shell script that, when executed, builds the complete set of host routing tables 
(actually, pre-loading of each ARP cache), that output is too large to include in 
this paper. A shorter version is simply a table that indicates which subnets are 
used for each pairwise communication, as shown in Fig. 3. 



O:-549459455950S999599499944445444454555595S955599944444941111ill9 
l: 5-97457495977597959579974447544457475559959755999444744757477549 
2: 49-4924993933299923999494444242243333349232333399442324422222242 
3J 974-694757957597959955774744544447455559979755999494744957477549 
4: 4495-97757975597974799975744544455477599559755994494449457477549 
6: 55299-6655255566559999999266226665655599552556969292229922222666 
6: 974476-697977669964999764464644467477696979776699494447667477649 
7j 4497766-97977667969499974446644467477696979776994494464467477666 
8: 59955599-5285289259559999328522288888358859585985292229928222289 
9: 553775775-357387573755735838788385855588833863835337337757388588 
10: 9999929923-38299222992932828282388883383938333882333239928383282 
11: 57357577853-7537573575735837788385835888853583835733337758387588 
12: 573755775787-333573555775738788385853388553388835733337757378588 
13: 5525556623253-66552555632236562653333353233333662223266662222266 
14: 99999666839336-6669996999838688383383383838336339393366968383666 
15: 979776979797366-979776969766766363333393933333699397369667377669 
16: 9999959925255569-29552662266566655655696952555662292229962222669 
17: 55257566572775672-2755662267726255677556572776662227766622222562 
18: 993949499323329992-442939424224243433393239333394394424422422242 
19: 9599799457955597574-99475744744457457559579775995494444957477549 
20: 47959999559755975549-9479747544457477549579755995444449457477549 
21: 999599999525556625299-669266226665655696552555665292229952222266 
22: 9947997997977699669446-99464646467677649979776664444447667477649 
23: 97977967933373966637769-9767666663337399973333939397366767377666 
24- 444459449525529922959999-224542445455559252555994242449952222249 
25: 4447724488887287224772472-48748287877848272778884427444422288282 
26: 44444664232333366624466624-6264363633643233336362434424422422262 
27: 474446468887868667447647436-684448488686878876664747747668473646 
28: 5525526657277567572752665726-26257675656552576662727226667272662 
39; 44444244288886366224424644682-2668688886282386682222446628282266 
30: 442446442828828666444666284462-648688646882886884242226622488282 
31: 4424464423333633622446464234266-63333643232333662342264422422662 
32: 45445666838885865545566643645646-5485646588358864444464468488586 
33: 573755778585533355377573573878835-377583888588885337337758377588 
34: 4434464488888333664446634864666343-38646883388334343446668388646 
35: 57357577858353335735757357387883873-7553888538885733737757337588 
36: 553575778535333357377577573858835787-358858583885737337758388588 
37: 5535556685388383653556635866686665653-88588388335333336668388568 
38: 55459999588885399595494954485844484553-9983555884444449958483549 
39: 999999663838833366399699933666636363339-883836389393366968338688 
40: 5929559933985239952555992228523258888598-52358999222229922228288 
41: 55375577533553335737757757375883888858885-3775835337337758377588 
42: 992992999383338322999293223822228838838323-388992393239928222288 
43: 5737757758358333573775735738588385855358873-55385733737757388588 
44: 55355577853883335737557357377883588388535785-3335737737757387588 
45: 553556665333836356355563586666638888385685853-665333336658388568 
46: 9939996993888636663996699836668688388383989336-69333339968333666 
47: 99999699838336396699966398666886683883889393366-9393336668388688 
48: 999949945525529922455549442422424545554995255599-224249952222249 
49j 4444424423377233223442432447722343377343233773332-33424722222242 
50: 44499999933332999299494942342244434333492393333923-2324422222242 
51: 442442442733333727444247274772224733734327337333432-237422427242 
52: 4737424423233233274442434447242243473343232773332432-37722272242 
53: 44244246233336662624424644242426634333462333333342233-6622322646 
64: 944499749797766996449976944766644767769697977696944776-667477669 
55: 4749496497977696964949679446666447677699979776969744766-67477649 
56: 15255266252556666225556652266222656556562525556652222266-1122566 
57: 172772773788728722277277222878228887888828877888222222771-177282 
58: 1424424423333233224442432244224443333343232333332224234411-21242 
59; 17277277288872872227727728277882878888882728888822227277272-7118 
60: 172772772887828722277277282822828787888887287888222722772717-188 
61: 1525566625255266652552662226622655655556252555662222266652211-66 
62: 14444646888886666644464648646686884886488888866844444464684186-6 
63: 992996969828866992299696922626226868889888888868922226996228866- 



Fig. 3. Basic FNN routing for KLAT2 
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3.3 Advanced Routing Tables 

As discussed above, in a typical FNN, many pairs of PCs will share multiple 
neighborhoods. Thus, additional bandwidth can be achieved for a single mes- 
sage communication by breaking the message into chunks that can be sent via 
different paths and sending the data over all available paths simultaneously - 
the FNN equivalent of channel bonding. What makes FNN advanced routing 
difficult is that, unlike conventional channel bonding, the FNN mechanism must 
be able to correctly manage the fact that NICs are bonded only for a specific 
message destination rather than for all messages. 

For example, in Fig. 2, PC kOO is connected to the blue, transparent purple, 
transparent blue, and transparent magenta neighborhoods. The second PC, kOl, 
shares three of those neighborhoods, replacing the transparent magenta with 
orange. The third PC, k02, has only two neighborhoods in common with kOO: 
blue and transparent blue. Thus, when kOO sends to kOl, three of its NICs can 
be used to create one wider data path, but when sending from kOO to k02, only 
two NICs can be used together. If kOO needs to send a message to k63, there is 
only one neighborhood in common and only one NIC can be used. 

Although sending message chunks through different paths is not trivial, the 
good news is that the selection of paths can be done locally (within each PC) 
without loss of optimality for any permutation communication. By definition, 
any communication pattern that is a permutation has only one PC sending to 
any particular PC. Because there is no other sender targeting the same PC, 
and all paths are implemented directly through wire-speed switches, there is 
no possibility of encountering interference from another PC’s communication. 
Further, nearly all Fast Ethernet NICs are able to send data at the same time 
that they are receiving data, so there is no interference within the NIC from 
other messages being sent out. Of course, there may be some memory access 
interference within the PCs, but that is relatively unimportant. 

A simple translator can encode the FNN topology so that a runtime pro- 
cedure can determine which NICs to specify as the sources and destinations. 
This is done by translating the switch neighborhood definitions into a table of 
NIC tuples. Each tuple specifies the NIC numbers in the destination PC that 
correspond to each of the NICs in the source PC. For example, the routing from 
kOO to kOl would be represented by a tuple of 1-3-4-0 meaning that kOO’s first 
NIC is routed to kOl’s first NIC, kOO’s second NIC is routed to the third NIC 
of KOI, the third NIC of kOO is routed to the fourth NIC of kOl, and the final 
value of 0 means that the fourth NIC of kOO is not used. 

To improve caching and simplify lookup, each of the NIC tuples is encoded 
as a single integer and a set of macros to extract the individual NIC numbers 
from that integer. Extraction of a field is a shift followed by a bitwise AND. 
With this encoding, the complete advanced routing table for a node in KLAT2 
is just 128 bytes long. 

Using standard Ethernet hardware, the routing by NIC numbers would re- 
quire the ARP cache in each machine to translate these addresses into MAC 
hardware addresses. This is easily done for small clusters, but can become less 
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efficient if very large ARP caches are required. Thus, it may be more practical 
to lookup MAC addresses directly rather than NIC numbers. The result is a 
modest increase in table size. For KLAT2, the MAC-lookup table would require 
1.5K bytes. 



3.4 Problems and Troubleshooting 

Unfortunately, the unusual properties of FNNs make it somewhat difficult to 
debug the system. Although one might expect wiring errors to be common, the 
color coding essentially eliminates this problem. Empirically, we have developed 
the following list of FNN problems and troubleshooting techniques: 

— The numbering of NICs depends on the PCI bus probe sequence, which might 
not be in an obvious order as the PCI bus slots are physically positioned on 
the motherboard. For example, the slots in the FIC SDll motherboards are 
probed in the physical order 1-2-4-3. Fortunately, the probe order is consis- 
tent for a particular motherboard, so it is simply a matter of determining 
this order using one machine before physically wiring the FNN. 

— If the FNN has an uplink switch, any unintended broadcast traffic, especially 
ARPs, can cripple network performance. Looking at the Ethernet status 
lights, it is very easy to recognize broadcasts; unfortunately, a switch failure 
also can result in unwanted broadcast traffic. Using a network analyzer and 
selectively pulling uplinks makes it fairly easy to identify the source(s) of 
the broadcasts. Typically, if it is a software problem, it will be an external 
machine that sent an ARP into the cluster. This problem can be fixed by 
appropriately adjusting ARP caches or by firewalling - which we strongly 
recommend for clusters. 

— Application-level software that assumes each machine has a single IP /MAC 
address independent of the originating PC will cause many routes to go 
through the FNN uplink switch, whereas normal cluster-internal communi- 
cations do not use the uplink switch. All application code should use host 
name lookup (e.g., in the local ARP cache) on each node. 

Given that the system is functioning correctly with respect to the above 
problems, physical wiring problems (typically, a bad cable or NIC) are trivially 
detected by failure of a ping. 



4 Performance 

Although the asymmetry of FNNs defies closed-form analysis, it is possible to 
make a few analytic statements about performance. Using KLAT2, we also have 
preliminary empirical evidence that the benefits predicted for FNNs actually are 
delivered. 
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4.1 Latency and Pairwise Bandwidth 

Clearly, the minimum latency between any pair of PCs is just one switch de- 
lay and the minimum bandwidth available on any path is never less than that 
provided by one NIC (i.e., lOOMb/s unidirectional, 200Mb/s bidirectional for 
Fast Ethernet). The bandwidth available between a pair of PCs depends on the 
precise wiring pattern, however, it is possible to compute a tight upper bound 
on the average bandwidth as follows. 

PCs communicate in pairs. Because no PC can have two NICs connected to 
the same switch, the number of ways in which a pair of connections through an S- 
port switch can be selected is S*(S-l)/2. Only switch ports that are connected 
to NICs count. Similarly, if there are P PCs, the number of pairs of PCs is 
P*(P-l)/2. If we sum the number of connections possible through all switches 
and divide that sum by the number of PC pairs, we have a tight upper bound 
on the average number of links between a PC pair. Since both the numerator 
and denominator of this fraction are divided by 2, the formula can be simplified 
by multiplying all terms by 2. 

For example, KLAT2’s network design described in this paper uses 4 NICs, 
31 ports on each of 8 switches and 8 ports on the ninth, and has 64 PCs (the 
two “hot spare” PCs are placed on the uplink switch). Thus, we get about 1.859 
bidirectional links/pair. In fact, the FNN design shown for KLAT2 achieves pre- 
cisely this average pairwise bandwidth. Using lOOMb/s Ethernet, that translates 
to 371.8Mb/s bidirectional bandwidth per pair. 

An interesting side effect of this formula is that, if some switch ports will 
be unused, the maximum average pairwise bandwidth will be achieved when all 
but one of the switches has all its ports used. Thus, the GA naturally tends to 
result in FNN designs that facilitate the folded uplink configuration. 



4.2 Bisection Bandwidth 

Bisection bandwidth is far more difficult to compute because the bisection is 
derived by dividing the machine in half in the worst way possible and measuring 
the maximum bandwidth between the halves. A reasonable upper bound on the 
bisection bandwidth is clearly the total number of NICs times the number of 
PCs times the unidirectional bandwidth per NIC; for KLAT2, this is 4*64*100, 
or 25.6Gb/s. 

Generally, bisection bandwidth benchmarks measure performance using a 
permutation communication pattern, but which pairwise communications are 
used is not specified and it can make a large difference which PCs in each half 
are paired. If we select pairwise communications between the two halves using 
a random permutation, the expected bisection bandwidth can be computed us- 
ing the average bandwidth available per PC, computed as described above. For 
KLAT2, this would yield 371.8Mb/s*32, or 11.9Gb/s. 

Of course, the above computations ignore the additional bandwidth available 
by hopping subnets using either routing through PCs or the uplink switch. Al- 
though a folded uplink switch adds slightly more bisection bandwidth than an 
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unfolded one, it is easy to see that a non-folded uplink switch essentially adds 
bisection bandwidth equal to the number of non-uplink switches used times the 
bidirectional uplink bandwidth. For KLAT2’s 9 switches with Fast Ethernet up- 
links, an unfolded uplink switch adds 1.8Gb/s to the 11.9Gb/s total, yielding 
13.7Gb/s. However, the routing techniques described in this paper normally ig- 
nore the communication paths that would route through the uplink switch. 

Further complicating the measurement of FNN bisection bandwidth is the 
fact that peak bisection bandwidth for a FNN is not necessarily achievable for 
any permutation communication pattern. The ability of multiple NIGs to func- 
tion simultaneously within each PG makes it far easier to achieve high bisection 
bandwidth using a pattern in which many PGs simultaneously send messages 
to various destinations through several of their NIGs. We know of no standard 
bisection bandwidth benchmark that would take advantage of this property, yet 
the performance increase is easily observed in running real codes. 



4.3 Empirical Performance 



The FNN concept is very new and we have not yet had time to fully evaluate its 
performance nor to clean-up and release the public-domain compiler and runtime 
software that we have been developing to support it. Thus, we have not yet run 
detailed network performance benchmarks. However, KLAT2’s FNN has enabled 
it to achieve very high performance on several applications. 

At this writing, a full GFD (Gomputational Fluid Dynamics) code [11], such 
as normally would be run on a shared-memory machine, is running on KLAT2 
well enough that it is a finalist for a Gordon Bell Price/Performance award. 
KLAT2 also achieves over 64 GFLOPS on the standard LINPAGK benchmark 
(using Scab APAGK with our 32-bit floating-point SDNow! SWAR extensions) . 

Why is performance so good? The first reason is the bandwidth. As described 
above, KLAT2’s FNN has about 25Gb/s bisection bandwidth ~ an ideal lOOMb/s 
switch the full width of the cluster would provide no more than 6.4Gb/s bisection 
bandwidth, and such a switch would cost far more than the FNN. Although Gb/s 
hardware can provide higher pairwise bandwidth, using a tree switch fabric yields 
less than lOGb/s bisection bandwidth at an order of magnitude higher cost than 
KLAT2’s FNN. 

Additional FNN performance boosts come from the low latency that results 
from having only a single switch delay between source and destination PGs and 
from the semi-independent use of multiple NIGs. Having four NIGs in each PG 
allows for parallel overlap in communications that the normal Linux IP mech- 
anisms would not provide with channel bonding or with a single NIG. Further, 
because each hardware interface is buffered, the FNN communications benefit 
from greater buffered overlap. 
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5 Conclusion 

In this paper, we have introduced a variety of compiler-flavored techniques for 
the design and use of a new type of scalable network, the Flat Neighborhood 
Network (FNN). 

The FNN topology and routing concepts make it exceptionally cheap to im- 
plement - the network hardware for KLAT2’s 64 (plus 2 spare) nodes cost about 
8,000 dollars. It is not only much cheaper than Gb/s alternatives that it outper- 
forms, but also is cheaper than a conventional lOOMb/s implementation would 
have been using a single NIC per PC and a cluster-width switch. 

The low cost and high performance are not an accident, but are features 
designed using a genetic search algorithm (GA) to create a network optimized 
for the specific communications that are expected to be important for the parallel 
programs the system will run. Additional compiler tools also were developed to 
manage the relatively exotic wiring complexity and routing issues. With these 
tools, it is easy and cost-effective to customize the system network design at a 
level never before possible. 

KLAT2, the first FNN machine, is described in detail at: 
http : / / aggregate . org/KLAT2/ 
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Abstract. Ownership sets are fundamental to the partitioning of pro- 
gram computations across processors by the owner-computes rule. These 
sets arise due to the mapping of data arrays onto processors. In this pa- 
per0 we focus on how ownership sets can be efficiently determined in the 
context of the HPF language, and show how the structure of these sets 
can be symbolically characterized in the presence of arbitrary data align- 
ment and data distribution directives. Our starting point is a system of 
equalities and inequalities due to Ancourt et al. that captures the array 
mapping problem in HPF. We arrive at a refined system that enables 
us to efficiently solve for the ownership set using the Fourier-Motzkin 
Elimination technique, and which requires the course vector as the only 
auxiliary vector. We develofO important and general properties pertain- 
ing to HPF alignments and distributions, and show how they can be used 
to eliminate redundant communication due to array replication. We also 
show how the generation of communication code can be avoided when 
pairs of array references are ultimately mapped onto the same proces- 
sors. Experimental data demonstrating the improved code performance 
that the latter optimization enables is presented and discussed. 



1 Introduction 

In an automated code generation scenario, the compiler decides the processors 
on which to execute the various compute operations occurring in a program. In 
languages such as High Performance Fortran (HPF) PHIj array mappings guide 
the computation partitioning process. They are specified by the programmer 
in terms of annotations called directives. The actual mapping process typically 
involves two steps: arrays are first aligned with a template and templates are 
then distributed onto virtual processor meshes. As a consequence of the align- 
ment operation — performed via the ALIGN directive — each array element gets 
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0144. 
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assigned to at least a single template cell. Every template cell is then associated 
with exactly one processor through the DISTRIBUTE directive. In this way, array 
elements are eventually mapped onto processors. 

In allocating program computations to processors, the compiler uses the map- 
ping information associated with the data. A possible scheme known as the 
owner- computes rule is to allow only the owner of the left-hand side reference in 
an assignment statement to execute the statement. By the owner-computes rule, 
expressions that use elements located on the same processor can be evaluated 
locally on that processor, without the need for inter-processor communication. 
When the need to transfer remote data elements arises, the compiler produces 
the relevant communication code. Hence, the owner-computes rule leads to the 
notion of an ownership set which is the set of all data elements mapped onto a 
processor by virtue of the alignment and distribution directives in the program. 

Since the assignment of computations to processors is determined by the al- 
location of data to processors, one of the aims of the HPF compilation problem 
is to find a suitable way for capturing the alignment and distribution infor- 
mation. Given such a means, the next issue that must be addressed is how the 
owner-computes rule can be realized using the representation. Does the proposed 
framework provide insights into the nature of the array mapping problem? Does 
the representation reveal general properties that can be leveraged to generate 
efficient code? In this paper, we investigate these questions in the context of a 
recent representation proposed by Ancourt et al Q. 

1.1 Related Work 

The problem of array alignment and distribution has been extensively stud- 
ied and numerous structures have been suggested and examined that describe 
the mapping of arrays to processors P] El 121 . Early representations focused 

on BLOCK distributions alone and were incapable of conveniently describing the 
general CYCLIC(i?) distribution. This deficiency was addressed in subsequent 
work by using techniques ranging from finite state machines, virtual processor 
meshes to set-theoretic methods EIIZIIIII However, these schemes primarily 
concentrated on enumerating local memory access sequences and handling array 
expressions. A generalized view of the HPF mapping problem was subsequently 
presented by Ancourt et al. Q showed how a system of equalities and in- 
equalities could be used to mathematically express the regular alignment and 
distribution of arrays to processors. These systems were then used to formulate 
ownership sets and compute sets for loops qualified by the INDEPENDENT direc- 
tive, and parametric solutions for the latter were provided based on the Hermite 
Normal Form |p. 

1.2 Contributions 

The system of equalities and inequalities in the Ancourt et al. framework uncover 
interesting properties that relate to the HPF mapping problem. We discuss these 
properties and show how some of them can be exploited to devise an efficient 
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run-time test that avoids redundant communication due to array replication. 
Our approach to solving for the ownership set is based on the Fourier-Motzkin 
Elimination (FME) technique, and in that respect, we deviate from We show 
how the originally proposed formulation for the ownership set can be refined to 
a form that involves the course vector as the only auxiliary vector and which 
also enables the efficient enumeration of the constituent elements of this set. We 
present a sufficient condition called the mapping test which eliminates the need 
for generating communication code for certain right-hand side array references 
in an assignment statement. The mapping test often results in marked perfor- 
mance improvements and this fact is substantiated by experimental data. The 
techniques mentioned in this paper have been incorporated into a new version of 
the PARADIGM compiler ^ , using Mathematic^ as the symbolic manipulation 
engine. 

2 Preliminaries 

Consider the synthesized declaration and HPF directives shown in Figure^ Since 
the first dimension of A and the single subscript-triplet expression conform, this 
fragment is equivalent to that shown in Figure The dummy variables i, j, k 
and I satisfy the constraints — 1 < f < 20, 3 < j < 40, 0 < fc < 20 and 0 < / < 99 
respectively. 



REAL A(-l:20, 3:40, 0:20) 



!HPF$ TEMPLATE T(0:99, 0:99, 0:99) 

!HPF$ PROCESSORS P(l:9, 1:9) 

!HPF$ ALIGN A(:, •, k) WITH T(2»k+1, 2:44:2, *) 
!HPF$ DISTRIBUTE T(*, CYCLIC(4), BL0CK(13)) ONTO P 



REAL AC-1:20, 3:40, 0:20) 



!HPF$ TEMPLATE T(0:99, 0:99, 0:99) 

!HPF$ PROCESSORS P(l:9, 1:9) 

!HPF$ ALIGN A(i, j, k) WITH T(2*k+1, {i+l)»2+2, 1) 
!HPF$ DISTRIBUTE T(*, CYCLIC(4), BL0CK(13)) ONTO P 



Fig. 1. Original Fragment 



Fig. 2. Equivalent Fragment 



The implications of the above alignment directives can be compactly ex- 
pressed through the following collection of equalities and inequalities P: 



'Rt = Aa -I- So — RIt, 

CLl ^ Ot ^ Oiu : 
0 ^ ^ UX — ^T- 



For the given example, the various matrices and vectors are 



R = 



/I 0 0 
1^0 1 0 




fO 0 2\ 

V2 0 0/^° 




( 1 ) 

( 2 ) 

(3) 



Mathematica is a registered trademark of Wolfram Research, Inc. 
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and, 




In a similar vein, information relating to the distribution directives can be 
captured by the following system PJ: 



rrt = CPc -T Cp -T 1, 


(4) 


Ac = 0, 


(5) 


0 < p < PI, 


(6) 


0 <1<C1, 


(7) 



where t satisfies 



( 0 . The respective matrices for the running example become 



7T 



/O 1 0 
1^0 0 1 




(A 0 
1^0 13 




/O 0 

loi 



While equation m is a consequence of the ALIGN directive, equations 
and 0 are a result of the DISTRIBUTE directive. While 0 dictates the mapping 
of template cells onto processors, 0 indicates whether a particular processor 
dimension has a BLOCK or a CYCLIC distribution associated with it. Constraints 
on the array bounds vector a and the template cell vector t are given by 0 and 
0 respectively. Finally, Q and 0 describe the constraints that the processor 
identity vector p and the offsets vector I must satisfy. We shall use x, y and z to 
represent the number of dimensions of the alignee uni, template and processor 
mesh respectively. Using this notation, the column vectors a and t consist of x 
and y elements respectively, while the column vectors p, c and I have z elements 
each. 

We can now formulate the ownership set of a processor p, which is defined 
with respect to an array X. It denotes those elements of X that are finally mapped 
onto the processor p. In set-theoretic notation, this set becomes (Q 

Ap{X) = {a\3t,c,l such that 
TZt = Aa -I- So — 'RIt 
/\ Trt= CPc + Cp + I 

A Ac = 0 (8) 

l\ ai < a < au 
A 0 ^ ^ Ux — It 

/\0<l < Cl}, 

where 0 < p < PI. 



3 Ownership Sets Revisited 

The new version of the PARADIGM compiler uses the system of equalities 
and inequalities described in § |2| to solve the HPF mapping problem 0. The 
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PARADIGM approach to solving for the ownership, compute and communica- 
tion sets is based on integer FME solutions P|. Though the FME technique has 
a worst-case exponential time complexity H2], we have observed the method to 
be reasonably efficient in practice. 

The elements of c, I and t manifest as auxiliary loop variables when compute 
sets derived from ownership sets are used to partition loops 0. It is therefore 
desirable to have a lesser number of unknowns from the standpoint of code 
generation. Reducing the number of unknowns also improves the overall timing 
of the FME solver. To meet these goals, the formulation for the ownership set 
given in Q was refined using two transformations: t* = {Aa -f Sq ~ TUt) + 
(/ — {CPc+ Cp) and c* = IrT^^T^Tr^c. Details are available in |S]. 



Lemma 3.1: The following is an equivalent definition for the ownership set: 

Ap{X) = {o|3c such that 

n^m'^iCPc + Cp)< 7r^7rTt^(Aa -f so - PIt) 

A tt"'" -kPA {Aa -I- So — 'RIt) < {CPc + Cp + Cl — 1) 

A 0 < CPc + Cp < ■n{uT — It) 

A (7 - T^TAR.Ti'^)c = 0 ^ ^ 

A Ac = 0 
A ai < a < Uu 

A 0 < Aci -t- So — TZIt ^ R.{ut — ^t)} 

where, for every a G Ap{X), there exists exactly one c for which the system in ()3) 
holds. 



An important consequence of Lemma|S]is that if the FME technique is now 
applied to the system in Q, then, whatever be the order of elimination of the 
unknown variables (corresponding to the elements of c and a), the associated 
loop nest scans every member of the ownership set (i.e., a) exactly once. To see 
why this is so, let e represent one such elimination order. Suppose 0 = ||e || = x+z 
where x and z denote the number of dimensions of the alignee and processor mesh 
respectively. The integer bound expressions returned by the FME solver can be 
used to construct a loop nest that scans Ap{X). The outermost loop in this nest 
matches ei, while the innermost loop matches eg. Consider an iteration point 
p of such a loop nest and let g be any other iteration point of the same loop 
nest. Thus, p and g also represent solutions to the system in Q. Since every 
iteration point of a loop nest is distinct, let p and g differ in the ith position. 
If = a„, then the a that corresponds to p obviously differs from the a that 
corresponds to g. If instead = Cj, then the c that corresponds to p differs from 
the c that corresponds to g. But from Lemma 1.3. 1 1 two different course vectors 
cannot satisfy the system for the same a. Thus, the corresponding values for a 
in p and g must also be different. That is, the a associated with the iteration 
point p must be different from the a associated with any other iteration point g 
of the same loop nest. In other words, every member of the ownership set gets 
enumerated exactly once. 
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Note that if the FME technique were applied to the system in (0 and if a 
loop nest were generated from the resulting solution system to scan Ap{X), it is 
not guaranteed that every member of Ap(X) will get enumerated exactly once. 
This inability to ensure the “uniqueness” of each enumerated member has serious 
repercussions if the ownership set is used to generate other sets. For instance, we 
use the compute sets defined in ^ to handle loops qualified by the INDEPENDENT 
directive. In |P, these sets are defined using a formulation of the ownership set 
similar to that in Q . If the FME technique were applied directly to the compute 
set formulation given in [P , certain iterations in this set could get executed more 
than once for certain alignment and distribution combinations. However, if the 
formulation in 0 is used, this problem can be avoided. The FME approach that 
we adopt to solve for these sets is quite different from the approach in P where 
a parametric solution based on the Hermite Normal Form was exploited for the 
same purpose. 

Though the system in 0 has the same number of inequalities as the formula- 
tion in 0, the number of unknowns to be eliminated is lesser hy y + z variables. 
This resulted in improved solution times for the FME solver. For instance, when 
Ap(A) was computed for the array A in the example in §13 a significant reduction 
in the time required to solve the system was noticed; while solving the system 
in 0 took 0.58 seconds, solving the system in 0 took only 0.25 seconds jS|. 

4 An Equivalence Relation 

It is interesting to enquire into the nature of ownership sets across processors. 
That is, for an arbitrary alignment and distribution combination, can these sets 
partially overlap? Or, are they equal or disjoint? The answers to these questions 
can be used to devise an efficient run-time test that avoids redundant commu- 
nication due to array replication (see § 0 . 



Lemma 4.1: If n1l^nn'^{p - q ) = 0, and Ap{X) A 0. Aq{X) / 0, then Ap{X) = 



To comprehend the meaning of Lemma 14. II an understanding of the expres- 
sion is necessary. The product ttTZ^TZtt"’" is a square diagonal matrix 

of size z X z. It is easy to see that the principal diagonal elements of this matrix 
are either 0 or 1. It is also easy to see that the jth principal diagonal element is 0 
if and only if the template dimension distributed on the jth processor dimension 
is a replicated dimension. What is a “replicated dimension”? We refer to those 
dimensions of a template that do contain a * in the alignment specification as 
its replicated dimensions^ The remaining non-* dimensions are called its aligned 
dimensions. The definitions of an aligned and replicated dimension arise on ac- 
count of a particular ALIGN directive and are always spoken of in connection with 

Specifically, replicated dimensions are those dimensions of a template that contain 
either a * or an unmatched dummy variable in the alignment specification (see cm)- 
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that directive. For example, in Figure Q the third dimension of T is a replicated 
dimension while its first and second dimensions are the aligned dimensions in 
that ALIGN directive. 

Suppose that the array X is aligned to a template T that is then distributed 
onto the processor mesh P. Lemma states that if the coordinates p and q of 
two processors in P match in at least those dimensions onto which the aligned 
dimensions of T are distributed, and if p and q own at least one element of the 
alignee array X, then their ownership sets with respect to X must be the same. 



Lemma 4.2: If Ap{X) n Aq{X) 7 ^ 0, then nTZ^iZn^ip -q)^0. 



The reverse is also true; if the ownership sets with respect to X of two proces- 
sors in P overlap, then their coordinates along those dimensions of P onto which 
the aligned dimensions of T are distributed must match. This is what Lemma ^21 
states. The above two lemmas can be used to prove the following theorem. 



Theorem I 

For a given array X, the ownership sets across processors must be either equal or 
disjoint. That is, 

A,{X) = A,{X), 
or 

Ap{X)nAq{X) = ^. 



Proof: 

Suppose Ap{X) and Aq{X) are not disjoint. Then Ap{X) n Aq{X) 7^ 0. Hence, from 
Lemma H. 21 we get 

{p — q) — 0. (I.l) 

Since we have assumed that Ap{X) n Aq{X) 7^ 0, then Ap{X) 7^ 0 and Aq{X) 7^ 0. 
By T.emma la, II this fact and dJ therefore imply 

Ap{X) = Aq{X). 

Thus, either Ap(X) n Aq{X) = 0 or Ap{X) — Aq(X) must be true. 



Let us define a binary relation ~ on a mapped array such that given two 
array elements /3 and 7 , /3 ~ 7 if and only if f3 and 7 are mapped onto the same 
processors. The rules of HPF ensure that for any legal ALIGN and DISTRIBUTE 
combination, every element of the mapped array will reside on at least one 
processor p j0|. Hence, ~ must be reflexive. Also, if /3 ~ 7 , 7 ~ /3 is obviously 
true. Therefore, ~ is symmetric. Finally, if /3 ~ 7 and 7 ~ J are true, then from 
Theorem n /3 ~ (5. That is, ~ is transitive. Hence, the ALIGN and DISTRIBUTE 
directives for a mapped array define an equivalence relation on that array. 
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5 The Replication Test 

Let \/p{S',Y{Sya + aoy)) indicate those elements of a right-hand side array 
reference Y(SyCt + ao ) contained in an assignment statement S' that p views 
on account of its compute work in S'. Thus p would have to potentially fetch these 
elements from a remote processor q and the set of elements to be received would 
be Aq{Y) n \7p{S',Y{Sya + Likewise, p would have to potentially send 

the elements in Ap{Y)r\\7 q{S',Y{Sya + Qo^)) to a remote processor q because q 
may in turn view the elements owned by p. Code fragments in Figure El illustrate 
how such data exchange operations can be realized. 



for each 0 < q < PI 
if p 7^ q then 

send /ip(V) n V,(S', V(S„a + ao,, )) to q 
endif 
endf or 

for each 0 < q < PI 
if p / q then 

receive /ig(V) n Vp(S', V(SyCt + ao,, )) from q 
endif 
endf or 



Fig. 3. Send and Receive Actions at Processor p 



In § El we saw that if Ap{Y) yf 0, and Aq{Y) yf 0, then Ap{Y) and Aq{Y) 
are equal if and only if 'R'k'^ {p — q) = 0. This property can be used to avoid 

redundant communication due to array replication; the modified code fragments 
in Figure 0 show this optimization. 



for each 0 < q < PI 

if = 0) V - p ) 5 ^ 0) then 

send zip(V) n V,(S', r(Sya -I- ao^)) to q 
endif 
endf or 

for each 0 < q < PI 

if (Ap(Y) = 0) V (TrlS^ (p - q) ^ 0) then 

receive zi,(V) n Vp(S', ^(Sya + ao,, )) from q 
endif 
endf or 



Fig. 4. Send and Receive Actions with the Replication Test Optimization 
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Once the integer FME solution for the system of equalities and inequalities 
which describe the ownership set is obtained, computing whether Ap{Y) is the 
empty set for a given p only incurs an additional polynomial-time overhead . 
The key idea that enables this decision is that in the FME solution system for 
(0, a particular pj will occur in the bound expressions for cj {pj and cj are 
elements in p and c respectively). That is, there will be an inequality pair of the 
form 

f](Pj) < Cj < gjipj) ( 10 ) 

in the solution system. In addition, there could be one more inequality pair in the 
FME solution system that contains pj in its bound expressions. This inequality 
pair will have the form Fj{pj,Cj) < < Gj{pj,Cj). Besides the above two 

inequality pairs, there can be no other inequality pair in the FME solution 
system that also contains pj. Hence, if J3) has a solution a for a given p, each 
of the 0 disjoint inequality groups 

fjiPj) < Cj < gjipj), 

Fj(pj,Cj) < Orij F Gj{pj,Cj) 

must independently admit a solution. The task of checking whether each of these 
groups has a solution for a particular pj is clearly of quadratic complexity. Hence, 
the complexity of ascertaining whether Ap(Y) is non-empty for a given p is 
polynomial. Since evaluating the condition Ttlv" {p — q) 0 is of 0{z) time 
complexity (as is the test p ^ q), the overall run-time complexity to evaluate 
the Boolean predicate, given the integer FME solution system for the ownership 
set (which is known at compile-time), becomes polynomial. 

Observe that in the absence of replication, 'R.^'R. is the identity matrix; in 
this situation, 'RfP" {p — q ) 0 if and only \i p ^ q. Hence, in the absence 

of replication, the test degenerates to the usual p ^ q condition; we therefore 
refer to the Boolean predicate nlG 'RfP" {p — q) 0 as the replication test. 

6 The Mapping Test 

Consider the assignment statement S' contained in a loop nest characterized by 
the loop iteration vector a, and in which the subscript expressions are linear: 

X{SxOL + ao„) = ■ • ■ -l- Y{SyOL + CLOy) -!-■■• 

Consider Ap{Y) C V q{S',Y{SyOL + ao^)) and Aq{Y) n V p{S',Y{SyOL + ao^)) 
shown in Figures 0 and 0 respectively. These communication sets would be gen- 
erated for the above assignment statement and take into account the relative 
alignments and distributions of the left-hand side and right-hand side array ref- 
erences. In the event of these communication sets being null, no communication 
occurs at run-time. However, the overhead of checking at run-time whether a par- 
ticular processor must necessarily dispatch a section of its array to some other 
processor that views it exists, irrespective of whether data is actually communi- 
cated or not. This could result in the expensive run-time cost of communication 
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checks, which could have been avoided all together, if the compiler had the ca- 
pacity to ascertain that the elements of the array references X{SxOt + ao„) and 
Y{SyOt + dOy) are in fact ultimately mapped onto the same processor. This is 
precisely what the mapping test attempts to detect. 

The mapping test (elaborated in Tjemma. jp. Ill is a sufficient condition which 
when true guarantees that the processor owning the left-hand side array refer- 
ence of an assignment statement also identically owns a particular right-hand 
side array reference of the same statement. By “identically owns,” we mean 
that for any value of the loop iteration vector a (not necessarily those limited 
by the bounds of the enclosing loop nest), the right-hand side array element 
Y{SyCX + dOy) resides on the same processor that owns the left-hand side array 
element X(,Sa:Q: -|- ao„). 



Lemma 6.1: Define two vectors ^ and ^ for the assignment statement S' ■. 

^ {Ax{SxOL + oo„) + so„ — HxIt), 

I = nsTly (^I,(5ya -I- Oo„ ) + - TZyls)- 

Here we assume that T and S are the templates to which X and Y are respectively 
aligned, and that both T and S are distributed onto a processor mesh P. If /i = 
■KTTZxiZxitT and 0 — TvsTZyTZyits , then 

Ap{Y) 7^0VO<p<Pl 

A i> < fi ( 11 ) 

A i>(LC'f mod PI) = LG's mod PI 

is a sufficient condition for the array reference y(5'^Q; -I- oo^ ) to be identically 
mapped onto the same processor which owns X{SxOt -t oo„). 



The sufficient condition stated in T;emma fb. I l ean be established at compile- 
time. The predicate fu forms the actual mapping test. Verifying whether v < [1 
and v{\Ci^^C\ mod PI) = mod PI is an 0{z) time complexity opera- 

tion. Establishing whether Ap{Y) yf 0 V 0 < p < PI is a polynomial time op- 
eration given the symbolic representation of the ownership set (see § El and |2j). 
Thus, the overall time complexity for verifying the requirements of Tyemma, |b. II 
is polynomial, once the FME solution system for the ownership set is known. 

The impact of the mapping test on run times can often be dramatic. To 
illustrate the savings, the run times for the ADI benchmark, for arrays of sizes 
4 X 1024 X 2 on a 1 X 4 mesh of processors, with and without this optimization were 
0.51 and 64.79 seconds respectively 0! The large value of 64.79 seconds arose due 
to three assignment statements which were the sinks of loop-independent flow 
dependencies and which were enclosed within a triply nested loop spanning an 
iteration space of 2048 x 2 x 1022 points. Each of the three assignment statements 
included right-hand side array references that were finally distributed onto the 
same processor as the corresponding left-hand side array reference. Hence, in 




Exploiting Ownership Sets in HPF 269 



all, eighteen communication checks (nine for MPI_SEND and another nine for 
MPI_RECV) per iteration were eliminated. 

7 Mapping Test Measurements 

Execution times and compilation times were measured for the PARADIGM (ver- 
sion 2.0) system with and without the mapping test optimization. For the sake 
of comparison, execution times and compilation times for the original sequen- 
tial sources and the parallelized codes generated by pghpf (version 2.4) and 
xlhpf (version 1.03) were also recorded, pghpf and xlhpf are commercial HPF 
compilers from the Portland Group Inc., (PGI) and the International Busi- 
ness Machines (IBM) respectively. In the input codes to the pghpf compiler, 
DO loops were recast into FORALL equivalents where possible and were qualified 
with the INDEPENDENT directive when appropriate. The FORALL construct and the 
INDEPENDENT directive were not mixed in the inputs to pghpf and the tabulated 
execution times correspond to the best of the two cases. All of the PARADIGM 
measurements were done in the presence of the replication test. 



7.1 System Specifications 

The IBM compilers xlf and mpxlf were used to handle Fortran 77 and For- 
tran 77-pMPI sources respectively. The HPF sources were compiled using xlhpf 
and pghpf. The -0 option, which results in the generation of optimized code, 
was always used during compilations done with xlf, xlhpf, mpxlf and pghpf. 
Gompilation times were obtained by considering the source-to-source transfor- 
mation effected by PARADIGM, as well as the source-to-executable compila- 
tion done using mpxlf (version 5.01). The source-to-source compilation times 
for PARADIGM were measured on an HP Visualize G180 with a 180MHz HP 
PA-8000 GPU, running HP-UX 10.20 and having 128MB of RAM. Gompilation 
times for pghpf, xlhpf as well as mpxlf were measured on an IBM E30 running 
AIX 4.3 and having a 133MHz PowerPG 604 processor and 96MB of main mem- 
ory. In those tables that tabulate the execution times, the RS6000 column refers 
to the sequential execution times obtained on the IBM E30. The parallel codes 
were executed on a 16-node IBM SP2 multicomputer, running AIX 4.3 and in 
which each processor was a 62.5MHz POWER node having 128MB of RAM. 
Inter-processor communication on the IBM SP2 was across a high performance 
adapter switch. 

7.2 Alignments and Distributions 

Measurements for the mapping test were taken across three benchmarks — ADI, 
Euler Fluxes (from FL052 in the Perfect Glub Suite) and Matrix Multiplica- 
tion. For all input samples, fixed templates and alignments were chosen; these 
are shown in Figure 0 Note that the most suitable alignments were chosen for 
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!HPF$ TEMPLATE T(4 


1024, 2) 




!HPF$ ALIGN DUl(i) 


WITH T(*, i, *) 




!HPF$ ALIGN DU2(i) 


WITH T(», i, *) 




!HPF$ ALIGN DU3(i) 


WITH T(*, i, *) 




!HPF$ ALIGN AUlCi, 


j , k) WITH T(i, 


j, k) 


!HPF$ ALIGN AU2(i, 


j , k) WITH T(i, 


j. k) 


!HPF$ ALIGN AU3(i, 


j , k) WITH T{i, 


j, k) 



!HPF$ TEMPLATE T(0:5001, 34, 4) 

!HPF$ ALIGN FS(i, j, k) WITH T(i, j, k) 
!HPF$ ALIGN DW(i, j, k) WITH T(i, j, k) 
!HPF$ ALIGN W(i, j, k) WITH T{i, j, k) 
!HPF$ ALIGN X(i, j, k) WITH T(i, j, k) 
!HPF$ ALIGN P(i, j) WITH T(i, j, *) 



!HPF$ TEMPLATE S(1024, 1024) 
!HPF$ TEMPLATE T(1024, 1024) 
!HPF$ ALIGN A(i, •) WITH S(i, *) 
!HPF$ ALIGN B(*. j) WITH S(*. j) 
!HPF$ ALIGN C(i, j) WITH T(i, j) 



Fig. 5. ADI, Euler Fluxes and Matrix Multiplication 



the benchmark input samples. For the Matrix Multiplication and ADI bench- 
marks, these alignments resulted in communication- free programs irrespective 
of the distributions. For the Euler Fluxes benchmark, the distributions resulted 
in varying amounts of communication. 

The PROCESSORS and the DISTRIBUTE directives were changed in every bench- 
mark’s input sample. The various distributions were chosen arbitrarily, the idea 
being to demonstrate the ability of the mapping test to handle any given align- 
ment and distribution combination. 



Table 1. Execution Times in Seconds 



Benchmark 


Processor 
Array Size 


Distribution 


IBM AIX 4.3 


RS6000 


SP2 


xlf 

V 5.01 


pdm 

V 2.0 


pghpf 
V 2.4 


xlhpf 

V 1.03 


Optimized 


Non- optimized 


ADI 


1x2 


(BLOCK, BLOCK, *) 


7.07 


1.41 


83.57 


16.78 


3.36 




1x4 


(BLOCK, BLOCK, *) 


6.79 


0.51 


61:79 


12.67 


2.35 




1x8 


(BLOCK, CYCLIC(120), *) 


7.90 


4.34 


609.44 


20.94 




Euler Fluxes 


1x2 


(BLOCK, BLOCK, *) 


71.85 


19.03 


19.44 


231.67 


33.79 




2x2 


(BLOCK, *, CYCLIC) 


71.40 


13.47 


13.83 


274.93 


3113.51 




8x1 


(*, BLOCK, CYCLIC) 


71.49 


5.96 


6.39 


91.83 


8.17 


Matrix 


2x1 


(BLOCK, BLOCK)* 


12.65 


2.47 


5.83 


57.48 


25.88 


Multiplication 


2x2 


(BLOCK, BLOCK)* 


104.96 


10.06 


17.20 


224.29 


100.37 




4x2 


(BLOCK, CYCLIC ( 120) )» 


104.74 


5.71 


205.00 


123.59 





7.3 Analysis 

As Table 1 reveals, the benefits of the mapping test were most pronounced for 
the ADI benchmark, followed by the Matrix Multiplication benchmark. In the 
case of the Euler Fluxes benchmark, the mapping test eliminated six communi- 
cation checks per iteration for the first input sample, and eight communication 
checks per iteration for the second and third input samples. In the absence of the 
mapping test, ten communication checks per iteration were generated. On ac- 
count of loop-carried flow dependencies, the associated communication code was 
hoisted immediately within the outermost loop. However, since the number of it- 
erations of the outermost loop was a mere 100, the optimized compiled codes did 
not exhibit any significant improvement in run times. For all of the ADI bench- 
mark input samples, the iteration space comprised of 2048 x 2 x 1022 points. 
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and the communication codes generated in the absence of the mapping test were 
hoisted immediately within the innermost loop. For the three Matrix Multipli- 
cation benchmark samples, the number of iteration points were 512 x 512 x 512, 
1024 X 1024 X 1024 and 1024 x 1024 x 1024 respectively and the single communi- 
cation check that was generated in the absence of the mapping test was hoisted 
within the second innermost loop. 



Table 2. Compilation Times in Seconds 



Benchmark 


Processor 
Array Size 


Distribution 


pdm 

V 2.0 


mpxlf 

V 5.01 


pghpf 
V 2.4 


xlhpf 

V 1.03 


Optimized 


Non-optimized 


Optimized 


Non-optimized 


ADI 


1 x 2 


(BLOCK, BLOCK, *) 


6.00 


9.00 


2.00 


7.00 


7.00 


5.00 




1 x 4 


(BLOCK, BLOCK, •) 


7.00 


8.00 


2.00 


7.00 


8.00 


5.00 




1 x 8 


(BLOCK, CYCLIC(120), *) 


7.00 


11.00 


2.00 


17.00 


8.00 




Euler Fluxes 


1 x 2 


(BLOCK, BLOCK, •) 


11.00 


12.00 


6.00 


8.00 


7.00 


6.00 




2 x 2 


(BLOCK, *, CYCLIC) 


10.00 


ITOT 


Too 


11.00 


9.00 


12.00 




8 x 1 


(*, BLOCK, CYCLIC) 


11.00 


12.00 


6.00 


22.00 


9.00 


5.00 


Matrix 


2 x 1 


(BLOCK, BLOCK)* 


4.00 


5.00 


2.00 


2.00 


5.00 


1.00 


Multiplication 


2 x 2 


(BLOCK, BLOCK)® 


Too 


5.00 


2.00 


3.00 


6.00 


1.00 




4 x 2 


(BLOCK, CYCLIC(120))® 


5.00 


5.00 


2.00 


3.00 


6.00 





Given a sequential input source written using Fortran 77 and having HPF di- 
rectives, PARADIGM produces an SPMD output consisting of Fortran 77 state- 
ments and procedure calls to the MPI library. The compilation of this SPMD 
code into the final executable is then performed using mpxlf . Since the mapping 
test eliminates the generation of communication code where possible, it also ex- 
erts an influence on the overall compilation times. That is, the application of 
the mapping test often results in the generation of a smaller intermediate SPMD 
code, and this improves on the back-end source-to-executable compilation time. 
In our setup, this was done using mpxlf. Note that applying the mapping test 
does not necessarily mean an increased time for the source-to-source compila- 
tion phase performed by PARADIGM. This is because though compilation in 
the presence of the mapping test involves the additional effort of identifying the 
candidate array reference pairs that are identically mapped, it however saves on 
the communication code generation part which would otherwise have to be done 
for the same array reference pairs. Hence, compilation times for the source-to- 
source compilation phase may in fact be more in the absence of the mapping test 
and this was found to be true for nearly all of the benchmark samples tested. 
However, as Table El also reveals, the back-end compilation times were nearly 
always more in the absence of the mapping test, and this was because of the 
larger intermediate SPMD code sizes handled. 



® xlhpf does not permit a CYCLIC blocking factor greater than 1. 
^ Arrays were of type REAL; array sizes were 512 x 512. 

® Arrays were of type REAL; array sizes were 1024 x 1024. 
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8 Summary 

The preceding sections have shown certain basic and interesting properties that 
ownership sets exhibit, even in the presence of arbitrary alignments and distri- 
butions. Our approach to solving for the ownership set and other sets derived 
from it is based on integer FME solutions to the systems characterizing these 
sets. We have also shown how the system of equalities and inequalities originally 
proposed in can be refined to a form requiring the course vector as the only 
auxiliary vector. This refinement is beneficial to the FME approach. The funda- 
mental property of ownership set equivalence is derived and we demonstrate how 
it can be used to eliminate redundant communication due to array replication. 
We also briefly describe how to efficiently make decisions regarding the “empti- 
ness” of an ownership set. Finally, we derive a sufficient condition which when 
true ensures that a right-hand side array reference of an assignment statement 
is available on the same processor that owns the left-hand side array reference, 
thus making it possible to avoid generating communication code for the pair. 

The mapping test is a very useful optimization. Its positive effect was ob- 
servable in the case of other benchmarks such as Jacobi, TOMCATV and 2-D 
Explicit Hydrodynamics (from the Livermore Kernel 18), and was significant in 
most situations. This was on account of the fact that typically, suitably chosen 
ALIGN and DISTRIBUTE directives perfectly align and distribute at least one pair 
of left-hand side and right-hand side array references in at least one assignment 
statement of the program, and such alignments and distributions are often valid 
whatever be the values that the loop iteration vector ranges through. 

Thus, by efficiently exploiting the ownership set, efficient SPMD code can be 
generated efficiently at compile-time. 
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Abstract. Optimizing a parallel program is often difficult. This is true, 
in particular, for inexperienced programmers who lack the knowledge and 
intuition of advanced parallel programmers. We have developed a frame- 
work that addresses this problem by automating the analysis of static 
program information and performance data, and offering active sugges- 
tions to programmers. Our tool enables experts to transfer programming 
experience to new users. It complements today’s parallelizing compilers 
in that it helps to tune the performance of a compiler-optimized paral- 
lel program. To show its applicability, we present two case studies that 
utilize this system. By simply following the suggestions of our system, 
we were able to reduce the execution time of benchmark programs by as 
much as 39%. 



1 Introduction 

Parallelization, performance data analysis and program tuning are very difficult 
tasks for inexperienced parallel programmers. Tools such as parallelizing compil- 
ers and visualization systems help facilitate this process. Today’s state-of-the-art 
parallelization and visualization tools provide efficient automatic utilities and 
ample choices for viewing and monitoring the behavior of parallel applications. 

Nevertheless, tasks such as identifying performance problems and finding the 
right solutions have remained cumbersome to many programmers. Meaningful 
interpretation of a large amount of performance data is challenging and takes 
significant time and effort. Once performance bottlenecks are found through 
analysis, programmers need to study code regions and devise remedies to address 
the problems. Programmers generally rely on their knowhow and intuition to 
accomplish these tasks. Experienced programmers have developed a sense of 
“what to look for” in the given data in the presence of performance problems. 
Tuning programs requires dealing with numerous individual instances of code 
segments. Categorizing these variants and finding the right remedies also demand 
sufficient experience from programmers. For inexperienced programmers there 
are few choices other than empirically acquiring knowledge through trials and 
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and #9975275-EIA. This work is not necessarily representative of the positions or 
policies of the U. S. Government 
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errors. Even learning parallel programming skills from experienced peers takes 
time and effort. No tools exists that help transfer knowledge from experienced 
to inexperienced programmers. 

We believe that tools can be of considerable use in addressing these problems. 
We have developed a framework for an automatic performance advisor, called 
Merlin, that allows performance evaluation experience to be shared with others. 
It analyzes program and performance characterization data and presents users 
with interpretations and suggestions for performance improvement. Merlin is 
based on a database utility and an expression evaluator implemented previously 
0. With Merlin, experienced programmers can guide inexperienced program- 
mers in handling individual analysis and tuning scenarios. The behavior of Mer- 
lin is controlled by a knowledge-based database called a performance map. Using 
this framework, we have implemented several performance maps reflecting our 
experiences with parallel programs. 

The contribution of this paper is to provide a mechanism and a tool that 
can assist programmers in parallel program performance tuning. Related work 
is presented in Section O Section 0 describes Merlin in detail. In Section 0 
and 0 we present two case studies that utilize the tool. First, we present a 
simple technique to improve performance using an application of Merlin for 
the automatic analysis of timing and static program analysis data. Next, we 
apply this system to the more advanced performance analysis of data gathered 
from hardware counters. Section 0 concludes the paper. 



2 Related Work 



Tools provide support for many steps in a parallelization and performance tuning 
scenario. Among the supporting tools are those that perform automatic paral- 
lelization, performance visualization, instrumentation, and debugging. Many of 
the current tools are summarized in 00. Performance visualization has been 
the subject of many previous efforts 0^1113110, providing a wide variety of 
perspectives on many aspects of the program behavior. The natural next step 
in supporting the performance evaluation process is to automatically analyze 
the data and actively advise programmers. Providing such support has been 
attempted by only few researchers. 

The terms “performance guidance” or “performance advisor” are used in 
many different contexts. Here, we use them to refer to taking a more active role 
in helping programmers overcome the obstacles in optimizing programs through 
an automated guidance system. In this section, we discuss several tools that 
support this functionality. 

The SUIF Explorer’s Parallelization Guru bases its analysis on two metrics: 
parallelism coverage and parallelism granularity m These metrics are com- 
puted and updated when programmers make changes to a program and run 
it. It sorts profile data in a decreasing order to bring programmers’ attention 
to most time-consuming sections of the program. It is also capable of analyz- 
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ing data dependence information and highlighting the sections that need to be 
examined by the programmers. 

The Paradyn Performance Consultant |S| discovers performance problems 
by searching through the space defined by its own search model (named 
space). The search process is fully automatic, but manual refinements to direct 
the search are possible as well. The result is presented to the users through a 
graphical display. 

PPA CD proposes a different approach in tuning message passing programs. 
Unlike the Parallelization Guru and the Performance Consultant, which base 
their analysis on runtime data and traces, PPA analyzes program source and 
uses a deductive framework to derive the algorithm concept from the program 
structure. Compared to other programming tools, the suggestions provided by 
PPA are more detailed and assertive. The solution, for example, may provide an 
alternative algorithm for the code section under inspection. 

The Parallelization Guru and the Performance Consultant basically tell the 
user where the problem is, whereas the expert system in PPA takes the role of 
a programming tool a step further toward an active guidance system. However, 
the knowledge base for PPA’s expert system relies on an understanding of the 
underlying algorithm based on pattern matching, and the tool works only for a 
narrow set of algorithms. 

Our approach is different from the others, in that it is based on a flexible 
system controlled by a performance map, which any experienced programmer 
can write. An experienced user states relationships between common perfor- 
mance problems, characterization data signatures that may indicate sources of 
the problem, and possible solutions related to these signatures. The performance 
map may contain complex calculations and evaluations and therefore can act 
flexibly as either or both of a performance advisor and an analyzer. In order to 
select appropriate data items and reason about them, a pattern matching mod- 
ule and an expression evaluation utility are provided by the system. Experienced 
programmers can use this system to help inexperienced programmers in many 
different aspects of parallel programming. In this way, the tool facilitates an 
efficient transfer of knowledge to less experienced programmers. For example, if 
a programmer encounters a loop that does not perform well, they may activate 
a performance advisor to see the expert’s suggestions on such phenomena. Our 
system does not stop at pointing to problematic code segments. It presents users 
with possible causes and solutions. 

3 Merlin: Performance Advisor 

Merlin is a graphical user interface utility that allows users to perform auto- 
mated analysis of program characterization and performance data. This data can 
include dynamic information such as loop timing statistics and hardware perfor- 
mance statistics. It can also include compiler-generated data such as control flow 
graphs and listings of statically applied techniques using the Polaris parallelizing 
compiler ca It can be invoked from the URSA Minor performance evaluation 
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tool as shown in Figure mm The activation of Merlin is as simple as clicking 
a mouse on a problematic program section from this tool. 




Fig. 1. Interface between related tools. The Merlin performance advisor plays a 
key role in the URSA Minor tool. It uses static and dynamic information collected by 
the Ursa Minor tool, and suggests possible solutions to performance problems. 



Through compilation, simulation, and execution, a user gathers various types 
of data about a target program. Upon activation. Merlin performs various 
analysis techniques on the data at hand and presents its conclusions to the 
user. Figure El shows an instance of the Merlin interface. It consists of an 
analysis text area, an advice text area, and buttons. The analysis text area 
displays the diagnostics Merlin has performed on the selected program unit. 
The advice text area provides Merlin’s solution to the detected problems with 
examples. Diagnosis and the corresponding advice are paired by a number (such 
as Analysis 1-2, Solution 1-2). 

Merlin navigates through a database that contains knowledge on diagnosis 
and solutions for cases where certain performance goals are not achieved. Ex- 
perienced programmers write performance maps based on their knowledge, and 
inexperienced programmers can view their suggestions by activating Merlin. 
Figure El shows the structure of a typical map used by this framework. It con- 
sists of three “domains.” The elements in the Problem Domain corresponds to 
general performance problems from the viewpoint of programmers. They can 
represent a poor speedup, a large number of stalls, etc., depending upon the 
performance data types targeted by the performance map writer. The Diagnos- 
tics Domain depicts possible causes of these problems, such as floating point 
dependencies, data cache overflows, etc. Finally, the Solution Domain contains 
possible remedies. These elements are linked by Conditions. Conditions are log- 
ical expressions representing an analysis of the data. If a condition evaluates to 
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Fig. 2. The user interface of Merlin in use. Merlin suggests possible solutions 
to the detected problems. This example shows the problems addressed in loop ACTFOR 
D0500 of program EDNA. The button labeled Ask Merlin activates the analysis. The 
View Source button opens the source viewer for the selected code section. The ReadHe 
for Map button pulls up the ReadMe text provided by the performance map writer. 
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true, the corresponding link is taken, and the element in the next domain pointed 
to by the link is explored. Merlin invokes an expression evaluator utility for 
the evaluation of these expressions. When it reaches the Solutions domain, the 
suggestions given by the map writer are displayed and Merlin moves on to the 
next element in the Problem domain. 



Problem 

Domain 



Diagnostics 

Domain 



Solution 

Domain 




Fig. 3. The internal structure of a typical Merlin performance map. The Prob- 
lem Domain corresponds to general performance problems. The Diagnostics Domain 
depicts possible causes of the problems, and the Solution Domain contains suggested 
remedies. Conditions are logical expressions representing an analysis of the data. 



For example, in the default map of the URSA Minor tool, one element in 
the Problem Domain is entitled “poor speedup.” The condition for this ele- 
ment is “the loop is parallel and the parallel efficiency is less than 0.7.” The 
link for this condition leads to an element in the Diagnostics Domain labeled 
“poor speedup symptoms” with conditions that evaluate the parallelization and 
spreading overheads. When these values are too high, the corresponding links 
from these conditions points to suggestions for program tuning steps, such as 
serialization, fusion, interchange, and padding. The data items needed to com- 
pute this expression are fetched from URSA Minor’s internal database using the 
pattern matching utility. If needed data are missing, (e.g., because the user has 
not yet generated hardware counter profiles,) Merlin displays a message and 
continues with the next element. The performance map is written in URSA Mi- 
nor’s generic input text format [Q. It is structured text of data descriptions that 
can be easily edited, so map writers can use any text editor. Merlin reads this 
file and stores it internally. When a user chooses a loop for automatic analysis, 
Merlin begins by analyzing the conditions in the Problem domain. 

Merlin differs from conventional spreadsheet macros in that it is capable 
of comprehending static analysis data generated by a parallelizing compiler. 
Merlin can take into account numeric performance data as well as program 
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characterization data, such as parallel loops detected by the compiler, the exis- 
tence of I/O statements, or the presence of function calls within a code block. 
This allows a comprehensive analysis based on both performance and static data 
available for the code section under consideration. 

A Merlin map enables efficient cause-effect analyses of performance and 
static data. It fetches the data specified by the map from the URSA Minor tool, 
performs the listed operations and follows the links if the conditions are true. 
There are no restrictions on the number of elements and conditions within each 
domain, and each link is followed independently. Hence, multiple perspectives 
can be easily incorporated into one map. For instance, stalls may be caused 
by poor locality, but it could also mean a floating point dependence in the 
pipeline CPU. In this way. Merlin considers all possible causes for performance 
problems separately and presents an inclusive set of solutions to its users. At 
the same time, the remedies suggested by Merlin assist users in “learning by 
example.” Merlin enables users to gain expertise in an efficient manner by 
listing performance data analysis steps and many example solutions given by 
experienced programmers. 

Merlin is able to work with any performance map as long as it is in the 
proper format. Therefore, the intended focus of performance evaluation may shift 
depending on the interest of the user group. For instance, the default map that 
comes with Merlin focuses on a performance model based on parallelization and 
spreading overhead. Should a map that focuses on architecture be developed and 
used instead, the response of Merlin will reflect that intention. Furthermore, 
the Ursa Minor environment does not limit its usage to parallel programming. 
Coupled with Merlin, it can be used to address many topics in optimization 
processes of various engineering practices. 

Merlin is accessed through the Ursa Minor performance evaluation 
tool 0. The main goal of URSA Minor is to optimize program performance 
through the interactive integration of performance evaluation with static pro- 
gram analysis information. It collects and combines information from various 
sources, and its graphical interface provides selective views and combinations of 
the gathered data. Ursa Minor consists of a database utility, a visualization 
system for both performance data and program structure, a source searching and 
viewing tool, and a file management module. URSA Minor also provides users 
with powerful utilities for manipulating and restructuring the input data to serve 
as the basis for the users’ deductive reasoning. URSA Minor can present to the 
user and reason about many different types of data (e.g., compilation results, 
timing profiles, hardware counter information), making it widely applicable to 
different kinds of program optimization scenarios. The ability to invoke Merlin 
greatly enhances the functionality of URSA Minor. 
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4 Case Study 1: Simple Techniques to Improve 
Performance 

In this section, we present a performance map based solely on execution timings 
and static compiler information. Such a map requires program characterization 
data that an inexperienced user can easily obtain. The map is designed to advise 
programmers in improving the performance of programs achieved by a paralleliz- 
ing compiler such as Polaris H2|. Parallelizing compilers significantly simplify the 
task of parallel optimization, but they lack knowledge of the dynamic program 
behavior and have limited analysis capabilities. These limitations may lead to 
marginal performance gains. Therefore, good performance from a parallel appli- 
cation is often achieved by a substantial amount of manual tuning. In this case 
study, we assume that programmers have used a parallelizing compiler as the 
first step in optimizing the performance of the target program. We also assume 
that the compiler’s program analysis results are available. The performance map 
presented in this section aims at increasing this initial performance. 

4.1 Performance Map Description 

Based on our experiences with parallel programs, we have chosen four program- 
ming techniques that are (1) easy to apply and (2) may yield considerable per- 
formance gain. These techniques are serialization, loop interchange, and loop 
fusion. They are applicable to loops, which are often the focus of the shared 
memory programming model. All of these techniques are present in modern 
compilers. However, compilers may not have enough knowledge to apply them 
most profitably and some code sections may need small modifications before 
the techniques become applicable automatically. 

We have devised criteria for the application of these techniques, which are 
shown in Tabled If the speedup of a parallel loop is less than 1, we assume that 
the loop is too small for parallelization or that it incurred excessive transforma- 
tions. Serializing it prevents performance degradation. Loop interchange may be 
used to improve locality by increasing the number of stride- 1 accesses in a loop 
nest. Loop interchange is commonly applied by optimizers; however, our case 
study shows many examples of opportunities missed by the backend compiler. 
Loop fusion can likewise be used to increase both granularity and locality. The 
criteria shown in Table [D represent simple heuristics and do not attempt to be 
an exact analysis of the benefits of each technique. We simply chose a speedup 
threshold of 2.5 to apply loop fusion. 

4.2 Experiment 

We have applied the techniques shown in TableDl based on the described criteria. 
We performed our measurements on a Sun Enterprise 4000 with six 248 MHz 
UltraSPARC V9 processors, each with a 16KB LI data cache and 1MB unified 
L2 cache. Each code variant was compiled by the Sun v5.0 Fortran 77 compiler 
with the flags -xtarget=ultra2 -xcache=16/32/l : 1024/64/1 -05. 
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Table 1. Optimization technique application criteria. 



Techniques 


Criteria 


Serialization 


speedup < 1 


Loop Interchange 


^ of stride- 1 accesses < # of non stride- 1 accesses 


Loop Fusion 


speedup <2.5 



The OpenMP code is generated by the Polaris OpenMP backend. The re- 
sults of five programs are shown. They are SWIM and HYDRD2D from the SPEC95 
benchmark suite, SWIM from SPEC2000, and ARC2D and MDG from the Perfect 
Benchmarks. We have incrementally applied these techniques starting with seri- 
alization. Figure 0 shows the speedup achieved by the techniques. The improve- 
ment in execution time ranges from -1.8% for fusion in ARC2D to 38.7% for loop 
interchange in SWIM’ 2000. For HYDRD2D, application of the Merlin suggestions 
did not noticeably improve performance. 




■ Fusion 

B Interchange 

■ Serialization 

□ Original OpenMP 



Fig. 4. Speedup achieved by applying the performance map of Table Q1 Detailed 
numbers can be seen in Table 0 The speedup is with respect to one-processor run 
with serial code on a Sun Enterprise 4000 system. Each graph shows the cumulative 
speedup when applying each technique. The original OpenMP program was generated 
by the Polaris parallelizer. 



Among the codes with large improvement, SWIM from SPEC2000 benefits 
most from loop interchange. It was applied under the suggestion of Merlin to 
the most time-consuming loop, SHALOW D03500. Likewise, the main technique 
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that improved the performance in ARC2D was loop interchange. MDG consists of 
two large loops and numerous small loops. Serializing these small loops was the 
sole reason for the performance gain. Table|2|shows a detailed breakdown of how 
often techniques were applied and their corresponding benefit. 



Table 2. A detailed breakdown of the performance improvement due to each 
technique. 



Benchmark 


Technique 


Number of Modifications 


% Improvement 


ARC2D 


Serialization 


3 


-1.55 


Interchange 


14 


9.77 


Fusion 


10 


-1.79 


HYDR02D 


Serialization 


18 


-0.65 


Interchange 


0 


0.00 


Fusion 


2 


0.97 


MDG 


Serialization 


11 


22.97 


Interchange 


0 


0.00 


Fusion 


0 


0.00 


SWIM ’95 


Serialization 


1 


0.92 


Interchange 


0 


0.00 


Fusion 


3 


2.03 


SWIM ’00 


Serialization 


0 


0.00 


Interchange 


1 


38.69 


Fusion 


1 


0.03 



Using this map, considerable speedups are achieved with relatively small 
effort. Inexperienced programmers can simply run Merlin to see the suggestions 
made by the map. The map can be updated flexibly without modifying Merlin. 
Thus if new techniques show potential or the criteria need revision, programmers 
can easily incorporate changes. 



5 Case Study 2: Hardware Counter Data Analysis 

In our second case study, we discuss a more advanced performance map that uses 
the speedup component model introduced in PI. The model fully accounts for the 
gap between the measured speedup and the ideal speedup in each parallel pro- 
gram section. This model assumes execution on a shared- memory multiprocessor 
and requires that each parallel section be fully characterized using hardware per- 
formance monitors to gather detailed processor statistics. Hardware monitors are 
now available on most commodity processors. 

With hardware counter and timer data loaded into Ursa Minor, users can 
simply click on a loop from the URSA Minor table view and activate Merlin. 
Merlin then lists the numbers corresponding to the various overhead compo- 
nents responsible for the speedup loss in each code section. The displayed values 
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for the components show overhead categories in a form that allows users to easily 
see why a parallel region does not exhibit the ideal speedup of p on p processors. 
Merlin then identifies the dominant components in the loops under inspection 
and suggests techniques that may reduce these overheads. An overview of the 
speedup component model and its implementation as a Merlin map are given 
below. 



5.1 Performance Map Description 

The objective of our performance map is to be able to fully account for the per- 
formance losses incurred by each parallel program section on a shared-memory 
multiprocessor system. We categorize overhead factors into four main compo- 
nents. Table 0 shows the categories and their contributing factors. 



Table 3. Overhead categories of the speedup component model. 



Overhead 

Category 


Contributing 

Factors 


Description 


Measured 

with 


Memory 

stalls 


IC miss 


Stall due to I-Cache miss. 


HW Cntr 




Write stall 


The store buffer cannot hold addi- 
tional stores. 


HW Cntr 




Read stall 


An instruction in the execute stage 
depends on an earlier load that is 
not yet completed. 


HW Cntr 




RAW load stall 


A read needs to wait for a previously 
issued write to the same address. 


HW Cntr 


Processor 

stalls 


Mispred. Stall 


Stall caused by branch mispredic- 
tion and recovery. 


HW Cntr 




Float Dep. stall 


An instruction needs to wait for the 
result of a floating point operation. 


HW Cntr 


Code over- 
head 


Parallelization 


Added code necessary for generating 
parallel code. 


computed 




Code generation 


More conservative compiler opti- 
mizations for parallel code. 


computed 


Thread 

manage- 

ment 


Fork&join 
Load imbalance 


Latencies due to creating and termi- 
nating parallel sections. 

Wait time at join points due to un- 
even workload distribution. 


timers 



Memory stalls reflect latencies incurred due to cache misses, memory access 
times and network congestion. Merlin will calculate the cycles lost due to these 
overheads. If the percentage of time lost is large, locality-enhancing software 
techniques will be suggested. These techniques include optimizations such as 
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loop interchange, loop tiling, and loop unrolling. We found, in that loop 
interchange and loop unrolling are among the most important techniques. 

Processor stalls account for delays incurred processor-internally. These in- 
clude branch mispredictions and floating point dependence stalls. Although it is 
difficult to address these stalls directly at the source level, loop unrolling and 
loop fusion, if properly applied, can remove branches and give more freedom to 
the backend compiler to schedule instructions. Therefore, if processor stalls are 
a dominant factor in a loop’s performance. Merlin will suggest that these two 
techniques be considered. 

Code overhead corresponds to the time taken by instructions not found in 
the original serial code. A positive code overhead means that the total number 
of cycles, excluding stalls, that are consumed across all processors executing the 
parallel code is larger than the number used by a single processor executing the 
equivalent serial section. These added instructions may have been introduced 
when parallelizing the program (e.g., by substituting an induction variable) or 
through a more conservative parallel code generating compiler. If code over- 
head causes performance to degrade below the performance of the original code. 
Merlin will suggest serializing the code section. 

Thread management accounts for latencies incurred at the fork and join 
points of each parallel section. It includes the times for creating or notifying 
waiting threads, for passing parameters to them, and for executing barrier op- 
erations. It also includes the idle times spent waiting at barriers, which are due 
to unbalanced thread workloads. We measure these latencies directly through 
timers before and after each fork and each join point. Thread management la- 
tencies can be reduced through highly-optimized runtime libraries and through 
improved balancing schemes of threads with uneven workloads. Merlin will 
suggest improved load balancing if this component is large. 

Ursa Minor combined with this Merlin map displays (1) the measured per- 
formance of the parallel code relative to the serial version, (2) the execution 
overheads of the serial code in terms of stall cycles reported by the hardware 
monitor, and (3) the speedup component model for the parallel code. We will 
discuss details of the analysis where necessary to explain effects. However, for 
the full analysis with detailed overhead factors and a larger set of programs we 
refer the reader to PI- 

5.2 Experiment 

For our experiment we translated the original source into OpenMP parallel form 
using the Polaris parallelizing compiler P). The source program is the Perfect 
Benchmark ARC2D, which is parallelized to a high degree by Polaris. We have 
used the same machine as in Section El For hardware performance measurements, 
we used the available hardware counter (TICK register) [Ej. 

ARC2D consists of many small loops, each of which has a few milli-seconds 
average execution time. Figure El shows the overheads in the loop STEPFX DD230 
of the original code, and the speedup component graphs generated before and 
after applying a loop interchange transformation. 
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Cycle Distribution Speedup Component Model 

2.5E+09 

2.0E+09 

1.5E+09 

^l.OE+09 

5.0E+08 

O.OE+00 

Serial No loop-interchange Loop-interchange 




□ Ideal 0 Measured Speedup ■ Memory Stalls SCode Overhead 



Fig. 5. Performance analysis of the loop STEPFX D0230 in program ARC2D. The 
graph on the left shows the overhead components in the original, serial code. The 
graphs on the right show the speedup component model for the parallel code variants 
on 4 processors before and after loop interchanging is applied. Each component of this 
model represents the change in the respective overhead category relative to the serial 
program. The sum of all components is equal to the ideal speedup (=4 in our case). 
Merlin is able to generate the information shown in these graphs. 



Merlin calculates the speedup component model using the data collected 
by a hardware counter, and displays the speedup component graph. Merlin 
applies the following map using the speedup component model: If the memory 
stall appears both in the performance graph of the serial code and in the speedup 
component model for the Polaris-parallelized code, then apply loop interchange. 
From this suggested recipes the user tries loop interchanging, which results in 
significant, now superlinear speedup. Figure El “loop-interchange” on the right 
shows that the memory stall component has become negative, which means 
that there are fewer stalls than in the original, serial program. The negative 
component explains why there is superlinear speedup. The speedup component 
model further shows that the code overhead component has drastically decreased 
from the original parallelized program. The code is even more efficient than in 
the serial program, further contributing to the superlinear speedup. 

In this example, the use of the performance map for the speedup compo- 
nent model has significantly reduced the time spent by a user analyzing the 
performance of the parallel program. It has helped explain both the sources of 
overheads and the sources of superlinear speedup behavior. 

6 Conclusions 

We have presented a framework and a tool. Merlin, that addresses an important 
open issue in parallel programming: Guiding inexperienced programmers in the 
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process of tuning parallel program performance. It is a utility with a graphical 
user interface that allows programmers to examine suggestions made by expert 
programmers on various performance issues. While equipped with a powerful 
expression evaluator and pattern matching utility, Merlin’s analysis and ad- 
vice are entirely guided by a performance map. Any experienced programmer 
can write a performance map to help new programmers in performance tuning. 
Merlin is accessed through a performance evaluation tool, so performance vi- 
sualization and data gathering are done in conjunction with this performance 
advisor. 

We have presented two case studies that utilize Merlin. In the first study, 
we have introduced a simplified performance map that can still effectively guide 
users in improving the performance of real applications. In the second study. 
Merlin is used to compute various overhead components for investigating per- 
formance problems. The results show that the relatively small effort to run Mer- 
lin can lead to significant speedup gains and insight into the performance be- 
havior of a program. 

With the increasing number of new users of parallel machines, the lack of ex- 
perience and transfer of programming knowledge from advanced to novice users 
is becoming an important issue. A novel aspect of our system is that it allevi- 
ates these problems through automated analysis and interactive guidance. With 
advances in performance modeling and evaluation techniques as exemplified in 
this paper, parallel computing can be made amenable to an increasingly large 
community of users. 
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Abstract. The internal mechanism used for a dependence test con- 
strains its accuracy and determines its speed. So does the form in which 
it represents array subscript expressions. The internal mechanism and 
representational form used for our Access Region Test (ART) is different 
from that used in any other dependence test, and therefore its constraints 
and characteristics are likewise different. In this paper, we briefly describe 
our descriptor for representing memory accesses, the Linear Memory Ac- 
cess Descriptor (LMAD) and the ART. We then describe the LMAD 
intersection algorithm in some detail. Finally, we compare and contrast 
the mechanisms of the LMAD intersection algorithm with the internal 
mechanisms for a number of prior dependence tests. 



1 Introduction 

Dependence analysis has traditionally been cast as an equation-solving activ- 
ity. The dependence problem was posed originally as a problem in constrained 
Diophantine equations. That is, the subscript expressions for two array refer- 
ences within nested loops were equated, constraints derived from the program 
text were added, and an attempt was made to solve the resulting system for 
integer values of the loop indices. Solving this problem is equivalent to integer 
programming, which is known to be NP-complete Pld, so many heuristics have 
been proposed for the problem over the last 25 years. We refer to this regime as 
point-to-point dependence analysis. 

More recently, principally due to the introduction of the Omega Test m, 
which made Fourier Elimination practical, dependence testing has been cast in 
terms of the solution of a linear system PEiizn! A linear system can represent 
multiple memory locations as easily as it can a single memory location, so people 
began using linear systems to summarize the memory accesses in whole loops, 
then intersected the summaries to determine whether there was a dependence. 
We refer to this as summary-based dependence analysis. 

Linear systems also became the basis for interprocedural dependence test- 
ing 13 El EDI, through the translation of the linear systems from one procedural 

* This work was supported in part by the US Department of Energy through the 
University of California under Subcontract number B341494. 

S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 289-023 2001. 
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context to another. When array reshaping occurs across the procedure boundary, 
this translation can lead to added complexity within the linear system. 

The memory locations accessed by a subscripted array reference have been 
represented to a dependence test by several forms. Triplet notation {begin : end 
: stride) and various derivatives of this form have been used for representing 
memory accesses 0 cni EH, as have sets of linear constraints. As indicated 
in inniHi, these forms are limited in the types of memory accesses that they can 
represent exactly. Sets of linear constraints H E3 0 are a richer representational 
mechanism, able to represent more memory reference patterns exactly than can 
triplet notation. Still, linear constraints cannot represent the access patterns of 
non-linear expressions. The Range Test is unique among dependence tests, in 
that it does not use a special representational form for its subscript expressions. 
It uses triplet notation for value ranges, but relies on the original subscript 
expressions themselves within its testing mechanism, so it loses no accuracy due 
to representational limitations. 

These techniques have been relatively successful within their domain of ap- 
plicability, but are limited by the restrictions that the representation and the 
equation-solving activity itself imposes. Almost all techniques are restricted to 
subscript expressions that are linear with respect to the loop indices. Most tests 
require constant coefficients and loop bounds. However, as noted in HS| , com- 
mon situations produce non-linear subscript expressions, such as programmer- 
linearized arrays, the closed form for an induction variable, and aliasing between 
arrays with different numbers of dimensions. The Range Test PI is the only test 
that can do actual dependence testing with arbitrary non-linear subscript ex- 
pressions. 

We have developed a new representational form, the Linear Memory Access 
Descriptor (LMAD) |T7^ II Sj. which can precisely represent nearly all the mem- 
ory access patterns found in real Fortran programs. We have also developed an 
interprocedural, summary-based dependence testing framework the Access 
Region Test (ART), that uses this representation. At the core of the dependence 
testing framework is an intersection algorithm that determines the memory lo- 
cations accessed in common between two LMADs. This is the equivalent of the 
equation-solving mechanisms at the core of traditional dependence tests. 

The LMAD intersection algorithm has restrictions similar to those of other 
dependence mechanisms. It cannot intersect LMADs produced by non-affine 
subscript expressions, or those produced within triangular loops. Its strengths 
come from the fact that it is an exact test, it can produce distance vectors, 
and that it can be used precisely, interprocedurally. Simplification operators 
have been defined for the LMAD, sometimes allowing non-afhne LMADs to be 
simplified to affine ones, compensating somewhat for the intersection algorithm’s 
limitations. 

In previous papers, we focused on an in-depth description of the represen- 
tational form of the LMAD and the dependence testing framework based on 
the form. In this paper, we describe the intersection algorithm used in our de- 
pendence testing, and discuss how it compares with the internal mechanisms of 
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other dependence testing techniques. In Section |2l we will cover brief summaries 
of the representational form and the dependence analysis framework, and de- 
scribe the intersection algorithm in the framework. Then, in Sections 0 smd 0 
we will consider other dependence analysis techniques and compare their core 
mechanisms with ours. Finally, we will summarize our discussion in Section |3 



2 A Dependence Analysis Framework Using Access 
Descriptors 

In this section, we present an overview of the Access Region Test (ART). The 
details of the ART can found in our previous reports imiia. 



2.1 Representing Array Accesses within Loops 

// A declared with m dimensions 
for ii = 0 to [7i { 
for l 2 = 0 to f /2 { 

for Id = Q to Ud { 

A(si(J),S 2 (J),---,Sm(J)) 

} 



} 

} 

Fig. 1. General form of a reference to array A in a d-nested loop. The notation 
I represents the vector of loop indices: (Ii, l 2 , •••,/(;)• 



Without loss of generality, in all that follows, we will assume that all loop 
indices are normalized. 

The memory space of a program is the set of memory locations which make up 
all the memory usable by a program. When an m-dimensional array is allocated 
in memory, it is linearized and usually laid out in either row-major or column- 
major order, depending on the language being used. In order to map the array 
space to the memory space of the program, the subscripting function must be 
mapped to a single integer that is the offset from the beginning of the array for 
the access. We define this subscript mapping Fa for an array reference with a 
subscripting function s, as in Figure Q by 

m 

S 2 , ' ■ ' 5 Srri) = ^ ^ Sfc • Afc . 

When the language allocates an array in column-major order, Ai = 1 and Xu = 
A/c-i • Tik-i for k ^ 1. li the language allocates the array in row-major order, 
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Am = 1 and Xk = Xk+i • nk+i for k ^ m. After applying Fa to the subscripting 
expressions Sk in Figure ^ we would have the linearized form: 

Fa(s(/)) = Sl(/)Al + S2{I)X2 + • • • + Sm{I)Xm- ( 1 ) 

As the nested loop in Figure H executes, the subscript mapping function Fa 
generates a sequence of offsets from the base address of A. We call this sequence 
the subscripting offset sequence: 

If the linearized form of the subscript expression s can be written in a sum- 
of-products form with respect to the individual loop indices, 

Fa{s{I)) = /o + /l(/l) + f 2 {h) + • • • + fm{Im). (2) 

then, we can isolate the effect of each loop index on the subscripting offset 
sequence. In this case, there is no restriction on the form of the functions fk- 
They can be subscripted-subscripts, or any non-affine function. 

We define the isolated effect of any loop in a loop nest on a memory reference 
pattern to be a dimension of the access. A dimension k can be characterized by 
its stride, Sk, and the number of iterations in the loop, Uk + ^- An additional 
expression, the span, is carried with each dimension since it is used in some 
operations. The stride and span associated with a given loop index Ik are defined 

Sk = fk[h^ h + ^~ fk (stride) (3) 

(Jk = fk[Ik ^ Uk] - fk[Ik ^ 0] (span) (4) 

where the notation /[* <— fc] means to replace every occurrence of i in expression 
/ by k. 

An LMAD is a representation of the subscripting offset sequence. It can be 
built for any array reference whose subscript expressions can be put in the form 
of Equation so all algorithms in this paper will assume this condition has 
been met for all LMADs. It contains all the information necessary to generate 
the subscripting offset sequence. Each loop index in the program is normalized 
internally for purposes of the representation, and called the dimension index. 

The LMAD contains: 

— a starting value, called the base offset, represented as r, and 

— for each dimension k: 

• a dimension index Ik, taking on all integer values between 0 and Uk, 

• a stride expression, Sk 

• a span expression, ak- 

The span is useful for doing certain operations and simplifications on the 
LMAD (for instance detecting internal overlap, as will be described in Sec- 
tion however it is only accurate when the dimension is monotonic and is 
not required for accuracy of the representation. The general form for an LMAD 
is written as 



A Sl,S2,---,Sd 
dri,CT2,'",o-d 
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real A(0 : N-1 , 0 : N-1) 
do I = 0, 9 
do J = 0, 2 

T: A(J+5,2* *I) = B(J,C(D) 

Fig. 2. Example of references to array A in statement T within a 2-nested loop. 



Examples of the LMAD form for A and B in statement T of Figure |2| are as 
follows: 

LMADt(A) = A + 5 (5) 

LMADt(B) = B + 5 (6) 

Note that an LMAD dimension whose span is zero adds no data e leme nts to 
an access pattern in the LMAD (we call this the zero-span vrovertv h * j l. The 
memory accesses represented by an LMAD are not changed when a dimension 
with a span of 0 is added. Also, note that insertion of a zero-span dimension can 
not change any width of an LMAD, and likewise, will not change the perfectly- 
nested property. Therefore, it is always possible to add a dimension with any 
desired stride to an LMAD, by making the associated span 0. So, given any 
two LMADs, with any dimensions, it is always possible to make them have 
dimensions with matching strides by adding zero-span dimensions to each. 

The following LMAD characteristics will be used later in the paper: 

Definition 1. Given an LMAD D, we call the sum of the spans of the first k 
dimensions the k-dimensional width of the LMAD, defined by 

k 

widthfe(I?) = gj. 

j=i 

Definition 2. Given an LMAD D, we say it is perfectly nested if, for all k, 

5k > widthfe_i(X>) 

2.2 Building Access Descriptors 

An LMAD representing the memory accesses in a nested loop, such as the one 
in Figure n is built starting from the inner-most loop of a loop nest out to the 
outer- most loop. The LMAD is first built to represent the scalar access made by 
the statement itself. As an example of that, notice that statement T by itself, in 
Figure |2] refers to a single memory location within A, namely A(J+5,2*I), using 
the current values of I and J. 

** This notation means that the loop indices take on values within Fa in normal loop- 
nest order. 

* * * For a proof of this, see )Tnn7| 
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Then, the process moves outward in the loop nest, expanding the LMAD 
successively by the loop index of each surrounding loop. The expansion operation 
is detailed in Figure 0 The dimensions of the previous descriptor are copied 
intact to the new descriptor. A single new dimension and a new base offset 
expression are formed, using the original descriptor’s base offset expression as 
input. For example, the statement T in Figure initially produces the following 
LMAD for array A: 

LMADt(A) = Al+I *2*A^+J + 5 

Expanding LMAD-r(A) for the J loop produces a stride expression of [ J + 1] — 
[J] = 1 and a new span expression of [2] — [0] = 2. The new offset expression 
becomes I*2*N + 0 + 5 = I * 2 * N + 5 making the expanded descriptor 
LMADt = A\+{I * 2 * N) + 5, which can be further expanded by the outer 
loop index I to produce the LMAD shown in Equation 0 Function Expand is 
accurate for any LMADs, regardless of their complexity. 

Input: Original LMAD +t, 

loop index ik with loop lower bound bk, upper bound Ck, and stride Sk 

Output: LMAD expanded by loop index ik, and new dimension-index i'k 

Function Expand: 

Create dimension-index i'k with 0 < i'k < Uk, Uk = [(e* — hk)/sk\ 

T ^ T[ik ^ i'k ■ Sk + bk] 

Sk ^ T[i'k ^ i'k + 1 ] - t; 

if {uk is unknown) { 

(J k < oo , 

} else { 

Cfe ^ T[i'k ^ Uk] - T[i'k ^ 0]; 

} 

nnew ^ n]ik ^ 0]; 

Insert new dimension in LMAD 

return 

end Function Expand 

Fig. 3. A function for expanding an LMAD by a loop index. 



2.3 Intersecting LMADs 

The ART is used within the general framework of memory classification analysis. 
Within memory classification analysis, the entire program is traversed in execu- 
tion order, using abstract interpretation, with access summaries being computed 
for each level of each loop nest and stored in LMADs. Whenever loops are en- 
countered, the LMADs for the accesses are intersected to determine whether 
the loops are parallel. For further details of memory classification analysis, 
see IT2] . 
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The intersection algorithm that is at the heart of the Access Region Test is 
shown in Figure 0 The precision of the algorithm suffers whenever the less-than 
relationship cannot be established between the various expressions involved in it. 
For example, for two strides to be sorted, less-than must be established between 
them. Establishing less-than depends on both the richness of the symbolic infor- 
mation derivable from the program and the symbolic manipulation capabilities 
of the compiler. 

So, provided that all necessary symbolic less-than relationships can be estab- 
lished, the intersection algorithm is an exact algorithm for descriptors meeting 
its input requirements. By “exact”, we mean that whenever there is an inter- 
section, it will return exactly the elements involved, and whenever there is no 
intersection, it will report that there is no intersection. The exactness of this 
algorithm makes possible an exact dependence test. 

The intersection algorithm operates on two LMADs which have the same 
number of sorted dimensions (d), and the same strides. As mentioned above, 
given any two LMADs, it is always possible to transform them into matching 
descriptors because of the zero-span property. In addition each descriptor must 
have the perfect nesting attribute. 

The dimensions on the two descriptors are sorted according to the stride, so 
that 5k+i > dfc, for 1 < fc < d — 1. In addition to the dimensions I through d, 
the algorithm refers to dimension 0, which means a scalar access at the location 
represented by the base offset. Likewise, of the two descriptors, it is required 
that one be designated the left descriptor (the one with the lesser base offset), 
while the other is the right descriptor. 

As shown in Figure 0, the algorithm treats the dimension with the largest 
stride first. For each dimension, the algorithm first checks whether the extents 
of the two dimensions overlap. If they do, the algorithm then makes two new de- 
scriptors from the original descriptors by stripping off the outermost dimension, 
and adjusting the base offsets of the two descriptors appropriately. It then recur- 
sively calls the intersection algorithm, targeting the next inner-most dimension 
this time. 

The algorithm can compute the dependence distance associated with each 
dimension of the descriptor. If a given dimension of both descriptors comes from 
the same original loop index and there is a non-empty intersection, then the 
algorithm can compute the accurate dependence distance for that loop. 

After the algorithm has checked each dimension and adjusted the base offsets 
at each step, it calls itself with a dimension of 0, which compares the adjusted 
base offsets of the two descriptors. If they are not equal, then the algorithm 
returns an empty intersection. If the base offsets are equal, then an intersection 
descriptor with that base offset is returned. As the algorithm returns from each 
recursive call, a dimension is added to the intersection descriptor. 

The intersection of a given dimension can occur at either side (to the right 
or left) of the right descriptor. If the intersection happens at both sides, then 
two intersection descriptors will be generated for that dimension. The algorithm 
thus returns a list of descriptors representing the intersection. 
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Algorithm Intersect 



Input; Two LMADs, with properly nested, sorted dimensions : 



LMADieft = 



[r < t' and 5i 



< (5i+l 



+ r, LMADHght = 



+ r 



The number of the dimension to work on: k [0 < fc < d] 

Direction to test: < or > 

Output: Intersection LMAD list and dependence distance 

Algorithm: 

intersect( LMADieft, LMADright, k, DIR) returns LMAD_List, DIST 
LMAD_List ^ 0 

D ^ t' — T 

if (fe == 0) then / / scalar intersection 
if (_D == 0 ) then 

LMADriisti ^ LMAD_scalar(r) 
add_to_list(LMAD_List, LMADnisti) 
endif 

return LMAD_List,0 
endif 

if { D < Ok ) then / / periodic intersection on the left 
R ^ mod(Z), 5k) 




II: LMADriisti <— intersect(remove_dim(LMADieft, fc, r + mdfc), 

remove_dim(LMADright, k, t'), k — 1,<) 
if ( LMADriisti 2 0 and loop indices match ) then 
DISTfc = (DIR is < ? m : -m) 

LMADriisti <— add_dim(LMADriisti, dim(Jfc, min((Tfc — m5k,o'k))) 
add_to_list(LMAD_List, LMADriisti) 

if ( (fc>l) and (R + widthfc_i > d^) ) then //periodic intersection right 
72: LMADriist 2 ^ intersect(remove_dim(LMADright, fc, t'), 

remove_dim(LMADieft, fc, r + (m + 1)5*,), 

fc- 1,>) 

if ( LMADriist 2 2 0 and loop indices match ) then 
DISTfc = (DIR is < ? m + 1 : -(m + 1)) 

LMADriist 2 ^ add_dim(LMADriist 2 , dim((5fc, min(crfc - (m + 1)4, a/))) 
add_to_list(LMAD_List, LMADriist 2 ) 
endif 

else // intersection at the end 

73: LMADriisti ^ intersect(remove_dim(LMADieft, k,T + Ok, 

remove_dim(LMADright, k, r'),fc-l,<) 
if ( LMADriisti 2 0 and loop indices match ) then 
DISTfc = (DIR is < I Ok 15k ■■ —ok/Sk) 
add_to_list(LMAD_List, LMADriisti) 
endif 

return LMAD_List, DIST 
end intersect 



Fig. 4. Algorithm Intersect for LMADs 
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Algorithm Intersect Support Routines 

dim( stride, span ) returns LMAD_Dimension 

Dimension. stride <— stride ; Dimension. span <— span 
return Dimension 
end dim 

add_dim( LMADust, Dimension ) returns LMADust 
For each LMAD in LMADust 

/ / Add “Dimension” to the list of dimensions of LMAD 
return LMADust 
end add_dim 

remove_dim( LMADarg, k, Offset ) returns LMAD 
LMADnew ^ LMADarg 

// Remove dimension k from the list of dimensions of LMADnew 
// Set the base offset of LMADnew to “Offset” 
return LMADnew 
end remove_dim 



Fig. 5. Algorithm Intersect support routines. 



The complexity of the LMAD intersection algorithm itself is exponential in 
the number of dimensions, d, since 2‘^ LMADs may be produced when 2 LMADs 
are intersected. 



The Access Region Test Memory classification analysis is used within the 
ART to classify the accesses, caused by a section of code, to a region of memory. 
A region of memory is represented by an LMAD. When an LMAD is expanded 
by a loop index (to simulate the effect of executing the loop), cross-iteration 
dependences are found by one of two means: 

1. discovering overlap within a single LMAD (caused by the expansion) 

2. a non-empty intersection of two expanded LMADs 

The first case is found by checking whether the LMAD is perfectly nested, 
as defined above. If it loses the perfectly nested attribute due to expansion by a 
loop index, then the expansion has caused an internal overlap. That means that 
the subscripting offset sequence represented by the LMAD contains at least one 
offset that appears more than once. 

The second case is found by using the LMAD intersection algorithm on each 
pair of LMADs for a given variable. A non-empty intersection indicates that 
the two LMADs represent at least one identical memory address. Since this is 
caused by expansion by a loop index, it represents a loop-carried cross-iteration 
dependence. 

A full description of the ART may be found in (TTl rT2|| . 
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3 Comparing the Internal Mechanisms of Dependence 
Tests 

This section will briefly describe several well-known dependence analysis tech- 
niques and the internal mechanisms used in them, in preparation for comparing 
their mechanisms with the LMAD intersection algorithm in Sectional In our de- 
scription of these tests, we will consider a loop nest of the form in Figure ^ The 
basic problem that these tests attempt to solve (the Dependence Problem) for two 
references to an array, A{si{I),S 2 {I), • • • , Sm{I)) and A{s[{I) , S 2 {I) , • • ■ , s'^{I)) 
is a simultaneous solution of a System of Equalities subject to a System of Con- 
straints'. 



Sk{I) = s'k{I)\l<k<m (7) 

Lt, < h < Uk \ I < k < m (8) 



We will describe the characteristics of these tests in terms of eight criteria: 

1. the acceptable form of coefficients of loop indices (coef form in Tabled, 

2. the acceptable form of loop bounds (loop bnd form), 

3. to what extent the test uses the System of Constraints (use constr?), 

4. under what conditions (if any) is the test exact (exact?), 

5. whether the test can produce dependence distance vectors (dist vec?), 

6. whether the test can solve all dimensions simultaneously? (simult soln), 

7. the complexity of the mechanism (complex), and 

8. the test’s ability to do interprocedural dependence testing (inter proc?). 
These characteristics are summarized for all tests in Tabled 

3.1 Basic Techniques 

We consider three dependence analysis techniques - GCD, Extreme Value, and 
Fourier Elimination (FE) - to be basic in that they form a core group which 
covers most of the other types of techniques that have been used to determine 
dependence. Descriptions of these can be found in |22| . 

3.2 Extended Techniques 

The basic dependence analysis techniques described above can be too naive 
or expensive in practice. To cope with the disadvantages, numerous advanced 
techniques extending the basic ones have been proposed, some of which will be 
described in this section. 

Omega Test The Omega Test CHI uses clever heuristics to speed up the FE 
algorithm in commonly-occurring cases. The complexity of the Omega Test is 
typically polynomial for situations occurring in real programs. Algorithms exist 
for handling array reshaping at call sites, although they may result in adding 
complicated constraints to the system. 
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Generalized GCD-Based Tests The GCD test can be generalized by writing 
the System of Equalities in matrix form: El = e, in which E is the matrix formed 
by the coefficients of the loop indices, / is a vector of loop indices, and e is a 
vector of the remaining terms, which do not involve loop indices. The system 
is solved by transforming matrix E to & column echelon form matrix E by a 
series of unimodular operations. The equivalent equation Ft = e is then solved 
by back-substitution. If there are no solutions to this equation, then there will 
be no solutions to the original equation, and therefore no dependences between 
the original array references. When there are solutions to this equation, then the 
solutions can be generated by values of the t vector. Sometimes the t vector can 
be used to compute dependence distances. 

A typical test belonging to this class of tests is the Power test 1231 that uses 
the Generalized GGD to obtain a t vector. The t vector is then manipulated 
to determine whether any inconsistencies can be detected with it. If the results 
are inconclusive, the test finally resorts to using an FE-based test to determine 
dependence. Other tests of this type are the Single Variable per Gonstraint test, 
the Acyclic test, and the Simple Loop Residue test [EJ. These tests can tolerate 
unknown variables by just adding them to the vector of unknowns. 

A Test The Extreme Value test can be generalized to test all dimensions of sub- 
script expressions simultaneously, as done in A-test CH In the A-test, an attempt 
is made to apply the loop bounds to the solution of the System of Equalities, 
to determine whether the hyper-plane representing the solution intersects the 
bounded convex set representing the loop bounds. Linear constraints are formed 
from Equation 0 and simplified by removing redundancies. The Extreme Value 
technique is applied to these simplified constraints to find simultaneous real- 
valued solutions by comparing with loop bounds. The testing time is relatively 
fast because it approximates integer- valued solutions for linear constraints with 
real- valued solutions. The test extends the original Extreme Value test to handle 
coupled subscripts, but has no capability to handle non-linear subscripts. 

I Test The I test is a combination of the Extreme Value test and the GGD 
test. The I test operates on a single equation from the System of Equalities at a 
time, putting it in the form of 6*^* = eo \ Ci G Z . 

It first creates an interval equation from that equation, giving the upper and 
lower bounds of the left-hand side. In the original form, the upper and lower 
bounds are both equal to the constant value on the right-hand side. It iterates, 
selecting a term from the left-hand side on each iteration, moving it into the 
upper and lower bounds on the right-hand side (using the Extreme Value test 
mechanism), then applying the GGD test on the remaining coefficients. This 
process continues until either it can be proven that the equation is infeasible, or 
there are no more terms which can be moved. 

Delta Test One main drawback of Fourier-based tests like the Omega and 
Power tests is their execution cost. To alleviate this problem, the Delta test 0 



300 Jay Hoeflinger and Yunheung Paek 



was developed for certain classes of array access patterns that can be commonly 
found in scientific codes. In some sense, the Delta test is a faster but more 
restricted version of the A-test because it simplifies the generalized Extreme 
Value test by using pattern matching to identify certain easily-solved forms of 
the dependence problem. Subscript expressions are classified into ZIV (zero- 
index variable), SIV (single-index variable) and MIV (multiple-index variable) 
forms. ZIV and SIV forms are sub-categorized into various forms which are easily 
and exactly solved. The test propagates constraints discovered from solving SIV 
forms into the solution of the MIV forms. This propagation sometimes reduces 
the MIV forms into SIV problems, so the technique must iterate. 



Range Test The Range test is a generalized application of the Extreme 
Value test. It can be used with symbolic coefficients, and even with non-affine 
functions for / and g from Figure 0 It starts with an equation of the form 
Sfci + bo — X)f=i + Co \ bi,Ci G Z. and attempts to determine whether 
the minimum value of the left-hand side (J™*") is larger than the maximum value 
of the right-hand side (g™“^), or whether the maximum value of the left-hand 
side (y™“^) is smaller than the minimum value of the right-hand side (5™“). It 
uses the fact that if /™“^(/i, /2, ■ ■ ■ ,Id) < g™*"(/i, /2, ■ ■ ■ ,Id + ^) for all values 
of all indices, and for 5™*" monotonically non-decreasing, then there can be no 
loop-carried dependence from A{f{l)) to A(^(I)). 

First, the Range test must prove that the functions involved are monotonic 
with respect to the loop indices. It does this by symbolically incrementing the 
loop variable and subtracting the original form from the incremented form of the 
expression. If the result can be proven to be always positive or always negative, 
then the expression is monotonic with respect to that loop variable. Once the 
subscript expressions are proven monotonic, the Range test determines whether, 
for a given nesting of the loops surrounding the array references being tested, 
the addresses of one reference are always larger or smaller than for the other 
reference. This is similar to the Extreme Value test, although the whole test is 
structured to make use of symbolic range information. 



4 Comparing the ART with Other Dependence 
Techniques 

Comparing the ratings of the LMAD intersection algorithm with the ratings of 
the other dependence testing mechanisms, it is apparent that the ART mecha- 
nism is most similar to the most powerful dependence testing techniques - those 
based on EE. Interprocedural accuracy may even be greater for the ART than 
for the FE-based techniques because of the precision with which LMADs can be 
translated precisely across procedure boundaries, without adding complexity to 
the representation. The FE-based techniques have a more general formulation, 
and therefore can avoid restrictions, such as those placed on the inputs of the 
LMAD intersection algorithm. 
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coef 

form 


loop bnd 
form 


use 

constr? 


exact? 


dist 

vec? 


simult 

soln 


complex 


inter 

proc? 


GCD 


ints 


N.A. 


no 


no 


no 


no 


lin 


no 


Ext Val 


ints 


ints 


loop bnd 


yes:coef ±1 


no 


no 


lin 


no 


FE 


ints 


exprs 


yes 


yes 


yes 


yes 


exp 


yes 


Omega 


ints 


exprs 


yes 


yes 


yes 


yes 


exp 

usual, poly 


yes 


Gen GCD 


ints 


exprs 


yes 


yes 


yes 


yes 


exp 


yes 


A 


ints 


ints 


loop bnd 


yes:2D 


yes 


yes 


poly 


no 


I 


ints 


ints 


loop bnd 


yes:coef ±1 


no 


no 


lin 


no 


A 


ints 


exprs 


yes 


usually 


yes 


yes 


lin/poly/exp 


no 


Range 


exprs 


exprs 


yes 


no 


no 


no 


poly 


no 


ART 


exprs 


exprs 


yes 


usually 


yes 


yes 


exp 


yes 



Table 1. The characteristics of dependence testing techniques compared. 



4.1 Comparison with GCD and Extreme Value 

The LMAD intersection algorithm essentially combines the type of testing done 
within the two most basic dependence testing algorithms, the GCD test and the 
Extreme Value Test, although with higher accuracy. The intersection algorithm 
actually finds values of the loop indices for which a dependence occurs, so it 
does more than just determine that a dependence exists. It is able to construct 
the distance vector for the dependence, which is more than either the GCD or 
Extreme Value Test can do. 

The LMAD intersection algorithm will report an intersection whenever the 
Extreme Value test would report a dependence. If the minimum (r') of one de- 
scriptor is greater than the maximum (t-|- widths) of the other, then intersect 
would employ step 13 for each dimension until for the scalar case the distance D 
would be greater than 0, meaning no intersection. In all other cases, the Extreme 
Value test would report a dependence, while the LMAD intersection algorithm 
would sometimes report no intersection. 

In steps II, 12, and 13, the base offsets of the LMADs are maintained at the 
original base offset plus a linear combination of the strides involved: r -I- tiSi 
and t' + '^j where i ^ j. The ti are formed from the values for m and m-l- 1, 
according to the rules for calculating DIST in the algorithm. Therefore, when 
the final comparison (for fc = 0) is made, it is equivalent to testing whether 

t' -|- tiSi = T + tjSj or t' — T = tjSj — tiSi- 
i j 3 i 

This means that the difference between the base offsets is made up of a linear 
combination of the coefficients of the loop indices, which is precisely what the 
GGD test attempts to test for, albeit in a crude way. The GGD test merely 
tells us whether it is possible that the difference between the base offsets can 
be expressed as a linear combination of the coefficients, while our intersection 
algorithm constructs precisely what the linear combination is. 





302 Jay Hoeflinger and Yunheung Paek 



4.2 Values Unknown at Compile Time 

Both the GCD test and the Extreme Value test require that the coefficients and 
loop bounds be integer constants because of the operations they carry out. There 
is no way to compute the GCD of symbolic expressions, and there is no way to 
compute the “positive part of a number” or the “negative part of a number” 
symbolically, as the Extreme Value test would require. 

The LMAD intersect algorithm, on the other hand, has fewer symbolic lim- 
itations. The operations used in it are all amenable to symbolic computation, 
except for the mod function. The mod function is only used in a check to de- 
termine whether there could be an intersection on the right, at step 12. If D is 
not a symbolic multiple of Sk (in which case we could use the value 0 for R), 
then we can just always call intersect at 12. The rest of the functions can 
be represented within symbolic expressions. The floor function is simple integer 
division. The min function can be carried in a symbolic expression, and in some 
cases simplified algebraically. 

5 Conclusion 

We have described the LMAD, the ART and the LMAD intersection algorithm. 
We discussed the properties of the intersection algorithm in some detail and 
compared it with the mechanisms of a set of well-known dependence tests. 

The comparison made in Table d demonstrates that the ART has charac- 
teristics similar to the most powerful dependence tests - those based on Fourier 
Elimination, yet the intersection mechanism is somewhat simpler to describe. 
The ART may have an advantage in interprocedural analysis, while the FE- 
based techniques are more general in formulation. 
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Abstract. The Abstract Parallel Machine (APM) model separates the 
definitions of parallel operations from the application algorithm, which 
defines the sequence of parallel operations to be executed. An APM con- 
tains a set of parallel operation definitions, which specify how the com- 
putation is organized into independent sites of computation and what 
data exchanges are required. This paper adds explicit cost models as the 
third component of an APM system. The costs of parallel operations 
can be obtained either by analyzing a parallel operation definition, or by 
measuring performance on a real machine. Costs with monotonicity con- 
straints allow the cost of an algorithm to be transformed automatically 
as the algorithm itself is transformed. 



1 Introduction 

There is increasing recognition of the fundamental role that cost models play 
in the design of parallel programs m- They enable the time and space com- 
plexity of a program to be determined, as for sequential algorithms, but parallel 
cost models serve several additional purposes. For example, intuition is often an 
inadequate basis for making the right choices about organizing the parallelism 
and using the system’s resources. Suitable high level cost models allow the pro- 
grammer to assess each alternative quantitatively during the design process, 
improving efficiency without requiring an inordinate amount of programming 
time. Portability of the efficiency is one of the chief problems in parallel pro- 
gramming, and cost models can help here by indicating where an algorithm 
should be modified to make effective use of a particular machine’s capabilities. 
Such motivations have led to a plethora of approaches to cost modeling. 

APMs (abstract parallel machines [E|) are an approach for describing 
parallel programming models, especially in the context of program transforma- 
tion. In this approach the parallel behavior is encapsulated in a set of ParOps 
(parallel operations), which are analogous to combinators in data parallel pro- 
gramming uni and skeletons in BMF An explicit separation is made be- 

tween the definitions of the ParOps and the specific parallel algorithm to be 

S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 16-03 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 
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implemented. An APM consists of a set of ParOps and a coordination language; 
algorithms are built up from the ParOps of one APM and are expressed using a 
coordination language for that APM. APMs are not meant as programming lan- 
guages; rather, they illustrate programming models and their relationships, and 
provide a basis for algorithm transformation. The relationships between different 
parallel operations can be clarified with a hierarchy of related APMs. 

There is some notion of costs already inherent in the definition of an APM, 
since the parallel operation definitions state how the operation is organized into 
parallel sites and what communications are required. The cost of an algorithm is 
determined by the costs of the ParOps it uses, and the cost of a ParOp could be 
derived from its internal description. This would allow a derivation to be based 
only on the information inside the APM definition. However, this is not the only 
way to obtain costs, and a more general and explicit treatment of costs can be 
useful. 

In this paper, we enrich the APM approach by adding an explicit hierarchy 
of cost models. Every APM can be associated with one or more cost models, 
reflecting the possibility that the APM could be realized on different machines. 
The cost models are related to each other, following the connections in the 
APM hierarchy. Each cost model gives the cost of every ParOp within an APM. 
There are several reasonable ways to assign a cost to a parallel operation: it 
could be inferred from the internal structure (using the organization into sites, 
communications and data dependencies); it could be obtained by transforming 
mathematically the cost of the corresponding operation in a related APM; it 
could be determined by measuring the real cost for a specific implementation. 

The goal is to support the transformation of an algorithm from one APM 
to another which gives automatically the new costs. Such a cost transformation 
could be used in several ways. The costs could guide the transformation of an al- 
gorithm through the APM hierarchy, from an abstract specification to a concrete 
realization. If the costs for an APM were obtained by measuring performance 
of a real machine, then realistic cost transformations are possible although the 
transformation takes place at an abstract level. In some algorithm derivations, 
it is helpful to begin with a horizontal transformation within the same APM 
that increases the costs. This can happen because the reorganized algorithm 
may satisfy the constraints required to allow a vertical transformation to a more 
efficient algorithm using a different APM. In such complex program derivations 
it is helpful to be explicit about the costs and programming models in use at 
each stage; that is the purpose of the APM methodology. 

The rest of the paper is organized as follows: Section 0 gives an overview of 
the APM approach. Section 0 introduces cost hierarchies to APM hierarchies. 
Sections E] and 0 illustrate the approach by examples. Section 0 concludes. 

2 Overview of APMs 

Abstract Parallel Machines (APMs) have been proposed in [Hj as a formal 
framework for the derivation of parallel algorithms using a sequence of transfor- 
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mation steps. The formulation of a parallel algorithm depends not only on the 
algorithmic-specific potential parallelism but also on the parallel programming 
model and the target machine to be used. Every programming model provides 
a specific way to exploit or express parallelism, such as data parallel models or 
thread parallelism, in which the algorithm has to be described. An APM de- 
scribes the behavior of a parallel programming model by providing operations 
(or patterns of operations) to be used in a program performed in that program- 
ming model. The basic operations provided by an APM are parallel operations 
(ParOps) which are combined by an APM-specific coordination language (usu- 
ally, e.g., including a composition function). The application is formulated for 
an APM with ParOps as the smallest indivisible parallel units to express a spe- 
cific application algorithm. Depending on the level of abstraction, an executable 
program (e.g., an MPI program for distributed memory machines) or a more 
abstract specification (e.g., a PRAM program for a theoretical analysis) results. 
The APM approach comprises: 

— the specification framework of ParOps defining the smallest units in a specific 
parallel programming model, see Section 12. L\ 

— APM definitions consisting of a set of ParOps and a coordination language 
using them, see Section 0 for an example; 

— a hierarchy of APMs built up from different APMs (expressing different 
parallel programming models) and relations of expressiveness between them; 

— the formulation of an algorithm within one specific APM, see also Section 2] 
for an example; and 

— the transformation of an algorithm into an equivalent algorithm (e.g., an 
algorithm with the same result semantics), but expressed in a different way 
within the same APM (horizontal transformation) or in a different APM 
(vertical transformation), see Section ^3 

In the following subsections, we describe the APM framework in more detail. 



2.1 Framework to Define a Parallel Operation ParOp 

A parallel operation ParOp is executed on a number of sites Pi, . . . ,P„ (these 
may be virtual processors or real processors). The framework for describing a 
ParOp uses a local function fi executed on site Pi using the local state Si of Pi for 
i = 1, . . . ,n and data provided by other sites Zi, . . . , Zn or input data xi, . . . ,Xr- 
Data from other sites used by Pi are provided by a projection function gi which 
selects data from the set V of available values, consisting of the inputs xi, ... ,Xr 
and the data zi, . . . , of all other sites, see FigureGl The result of a ParOp is a 
new state s'l, . . . , and output data yi, ... ,yr. A ParOp is formally defined by 



ParOp APG(si, ... ,s„) {xi,...,Xr) = ((s'l, . . . , 4), {yi,...,yt)) 
where {si,Zi) = fi{si, gi (F)) 

{yi, ...,yt) = g (V) 

V — ((xi , . . . , Xr) ^ Zij . . . j Zn) 



( 1 ) 
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Fig. 1. Illustration of the possible data flow of a ParOp. fi denotes the local computa- 
tion of site Pi. Qi chooses values from x, Ai, . . . , An. Only one projection function gi is 
depicted so as to keep the illustration readable. The arrows do not indicate a cycle of data 
dependencies since the value provided by /; need not be given back by gi. 



where ARG is a list of functions from {fi,...,fn,go,gi,...,gn} and contains 
exactly those functions that are not fully defined within the body of the ParOp. 
The functions fi, . . . , fn, go, gi, ■ . . , g-n in the body of the ParOp definition can 

— be defined as closed functions, so that the behavior of the ParOp is fully 
defined, 

— define a class of functions, so that details have to be provided when using 
the ParOp in a program, or 

— be left undefined, so that the entire function has to be provided when using 
the ParOp. 

The functions that have to be provided when using the ParOp appear in the 
function argument list ARG as formal parameters in the definition and as 
actual functions in a call of a ParOp. 

The framework describes what a ParOp does, but not necessarily how it is 
implemented. In particular, the gi functions imply data dependencies among the 
sites; these dependencies constrain the set of possible execution orders, but they 
may not fully define the order in an implementation. Consequently, the cost 
model for a ParOp may make additional assumptions about the execution order 
(for example, the degree of parallelism). 

2.2 Using APMs in a Program 

To express an application algorithm, the parallel operations defined for a specific 
APM are used and combined according to the coordination language. When 
using a ParOp in a program, one does not insert an entire ParOp definition. 
Instead, the operation is called along with any specific function arguments that 
are required. Whether function arguments are required depends on the definition 
of the specific ParOp. 
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If the function arguments fi, gi, are fully defined as functions in closed form, 
then no additional information is needed and the ParOp is called by just using 
its name. If one or more functions of ft, gi, i = I, . . . ,n, are left undefined, then 
the call of this ParOp has to include the specific functions possibly restricted 
according to the class of functions allowed. A call of a ParOp has the form 

ParOp ARG 

where ARG contains exactly those functions of (/i, . . . , fn){go, ■ ■ ■ ,gn) that are 
needed. This might be given in the form 

/ki = definition, . . . , /k; = definition, 

= definition, . . . , 5^^ = definition, 
ParOp(/«,,.../„,,g^i,...g^J 



2.3 Vertical and Horizontal Transformations between APMs 

One goal of the ARM approach is to model different parallel programming models 
within the same framework so that the relationship between two different models 
can be expressed. The relationship between two parallel programming models is 
based on the expressiveness of the APMs which is captured in the ParOps and 
the coordination language combining the ParOps. 

We define a relation between two different APMs in terms of a transforma- 
tion mapping any program for an APM M\ onto a program for an APM M2. 
The transformation is built up according to the structure of an APM program; 
thus it is defined on the ParOps of APM Mi and then generalized to the en- 
tire coordination language. The transformation of a ParOp is based on its result 
semantics, i.e., the local data and output produced from input data at a given 
local state. 

An APM Ml can be simulated by another APM M2 if for every ParOp F 
of Ml there is ParOp G (or a sequence of ParOps Gi, . . . ,Gi) which have the 
same result semantics as F, i.e., starting with the same local states si, . . . , 
and input data x it produces the same new local states s'l, . . . ,s'^ and output 
data y. 

If an APM Mi can be simulated by an APM M2, this does not necessarily 
mean that M2 can be simulated by Mi. If Mi can be simulated by M2, then 
Ml is usually more abstract than M2. Therefore, we arrange Mi and M2 in a 
hierarchical relationship with Mi being the parent node of M2. Considering an 
entire set of APMs, we get a tree or a forest showing a hierarchy of APMs and 
the relationship between them, see Figure 0 . 

The relationship between APMs serves as a basis for transforming an al- 
gorithm expressed on one APM to the same algorithm now expressed in the 
second related APM. For two related APMs Mi and M2 a transformation oper- 
ation from Ml to M2 is defined according to the simulation relation, i.e., 
for each ParOp F of APM Mi 

K:{F) = G {orT^l{F) = Gi,...,Gi) 
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Derivation sequence of an algorithm 
Aj A3 A7 A9 



Refinement of abstract parallel machines 
(^M J 



/ X constraint x ^ 

— Tapm) 



Reahzation of A-i according to(APN^ 
(APM ]) ^ Ai 



Hierarchy of abstract machines 



Derivation of an algorithm 



Explanation of the notation 



Fig. 2. Illustration of a hierarchy of abstract parallel machines and a derivation of an 
algorithm according to the hierarchy. 



where G is the ParOp of M2 to which F is related. In this kind of transformation 
step (which is called a vertical transformation), the program Ai is left essentially 
unchanged, but it is realized on a different APM: thus (Ai, Mi) is transformed to 
(A(, M2). The operations in Ai are replaced by the transformation T^^(F-^), 
so that A2 uses the parallel operations in M2 which realize the operations used 
by Ai. 

There can also be a second kind of transformation (called a horizontal trans- 
formation) that takes place entirely within one APM Mi: (Ai,Mi) is trans- 
formed into (^2 ,Mi), where a correctness-preserving transformation must be 
used to convert Ai into A2 . In the context of our methodology, this means that 
a proof is required that for all possible inputs A°, . . . , and states a, the two 
versions of the algorithm must produce the same result, i.e. 

Ai(X0,...,A«,a) = A2(A°,...,X",a). 

There are several approaches for developing parallel programs by performing 
transformation steps, many of which have been pursued in a functional pro- 
gramming environment. Transformations based on the notion of homomorphism 
and the Bird-Meertens formalism are used in m- P3L uses a set of algorithmic 
skeletons like pipelines and worker farms to capture common parallel program- 
ming paradigms . A parallel functional skeleton technique that emphasizes the 
data organization and redistributions is described in (El. A sophisticated ap- 
proach for the cost modeling of the composition of skeletons for a homomorphic 
skeleton system equipped with a equational transformation system is outlined in 
pnjEi]. The costs of the skeletons are required to be monotonic in the costs of 
the argument functions. The method performs a stepwise application of rewrit- 
ing rules such that each application of a rewriting rule is cost-reducing. All these 
approaches restrict the algorithm to a single programming model, and they use 
the costs only to help select horizontal transformations. Vertical transformations 
between different programming models which could be used for the concretiza- 
tion of parallel programs are not supported. 
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Fig. 3 . Illustration of the connection between APMs, their corresponding costs and al- 
gorithms expressed using those APMs. The hierarchy contains only APMi and APM2 with 
associated cost model costsi and costS2. An algorithm A can be expressed within APMi 
and is transformed horizontally into algorithm B within the same APM which is then 
transformed vertically into algorithm C within APM2. 



3 Cost Hierarchies 

The APM method proposed in has separated the specifics of a parallel pro- 
gramming model from the properties of an algorithm to be expressed in the 
model. Since the APMs and the algorithm are expressed with a similar formal- 
ism, and the relations between APMs are specified precisely, it is possible to 
perform program transformations between different parallel programming mod- 
els. In this section, we enrich the APM approach with a third component to 
capture costs, see Figure 0. The goal is to provide information that supports 
cost-driven transformations. 



3.1 Cost Models for Leaf Machines 

We consider an APM hierarchy whose leaves describe real machines with a pro- 
gramming interface for non-local operations. These would include, for example, 
a communication library for distributed memory machines (DMMs) or a coor- 
dination library for accessing the global memory of shared memory machines 
(SMMs) concurrently. Each operation of the real parallel machine is modeled by 
an operation of the corresponding leaf APM. By measuring the runtimes of the 
operations on the real parallel machine, the APM operations can be assigned 
costs that can be used to describe costs of a program. Since the execution times 
of many operations of the real machine depend on a number of parameters, the 
costs of the APM operations are described by parameterized runtime functions. 
For example, the costs of a broadcast operation on a DMM depend on the num- 
ber p of participating processors and the size n of the message to be broadcast. 
Correspondingly, the costs are described by a function 

tbroad{p,n) = f{p,n) 

where / depends on the specific parallel machine and the communication library. 
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Usually, it is difficult to describe exactly the execution time of operations 
on real parallel machines. The local execution times are difficult to describe, 
since the processors may have a complicated internal architecture including a 
memory hierarchy, several functional units, and pipelines with different stages. 
Moreover techniques like branch prediction and out-of-order-execution may be 
used. The global execution times may be difficult to describe, since, for example, 
a distributed shared memory machine uses a physically distributed memory and 
emulates a shared memory by several caches, using, e.g., a directory-based cache 
coherence protocol. But for many (regular) applications and many DMMs, it is 
possible to describe the execution times of the machines accurately enough to 
compare different execution schemes for the same algorithm The main 

concern of this article is not so much to give an accurate model for a specific 
(class of) parallel machines, but rather to extend an existing model so that it 
can be used at a higher programming level to compare different implementations 
of the same algorithm or to guide program transformations that lead to a more 
efficient program. 

3.2 Bottom-Up Construction of a Cost Hierarchy 

Based on the cost models for the leaves of an APM hierarchy, cost models for 
the inner nodes of an APM hierarchy can be derived step by step. We consider 
an APM Mi which is the parent of an APM M2 for which a cost measure C2 
has already been defined. At the beginning of the derivation M2 has to be a 
leaf. Since Mi is the parent of M2, there is a transformation which assigns 
each parallel operation F of Mi an equivalent sequence Gi, . . . ,Gi of parallel 
operations, each of which has assigned a cost C2{Gi). We define a cost measure 
Cmi^M2 based on the cost measure G2 for M2 by 

i 

Gm,^M2{F) = J2C2{G,). ( 2 ) 

i=l 

A cost measure C2 for M2 may again be based on other cost measures, if M2 
is not a leaf. If the programmer intends to derive a parallel program for a real 
parallel machine R which is a leaf in the APM hierarchy, each intermediate level 
APM is assigned a cost measure that is based on the cost of R, i.e., the selection 
of the cost measure is determined by the target machine. Thus, for each path 
from a leaf to an inner node B there is a possibly different cost measure. 

We now can define the equivalence of cost measures for an inner node M 
of an APM hierarchy with children Mi and M2. Cost measures Gm^Mi and 
Cm^M2 for APM M can be defined based on cost measures for Ci and G2 of 
Ml and M2, respectively. We call Gm^Mi and Cm^M 2 equivalent if for arbitrary 
programs Ai and A2, the following is true: 

If C'm^M 2 (^i) < Gm^M 2 {^ 2 ) then C'm^Mi(Ai) < Cm^Mi{A 2 ) 

and vice versa. If two cost measures are equivalent, then both measures can be 
used to derive efficient programs and it is guaranteed that both result in the 
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same program since both have the same notion of optimality. Note that the 
equivalence of two cost measures for an APM does not require that they yield 
the same cost value for each program. 

3.3 Monotonicity of Cost Measures 

The cost measure for an APM can be used to guide horizontal transformations. 
For this purpose, a cost measure must fulfil the property that a horizontal trans- 
formation that is useful for an APM M is also useful for all concretizations of 
M . This property is described more precisely by the notion of monotonicity of 
cost measures. We consider APMs M2 and M\ where M2 is a child of Mi in 
the APM hierarchy. Let Ai and A2 be two programs for APM Mi where A2 is 
obtained from A± by a horizontal transformation Tmi which reduces the costs 
according to a cost measure Ci, i.e., 

Ci(Ai) > C,{A2) = Ci{TmMi))- 

Let be the vertical transformation from Mi to M2, i.e., the corresponding 
programs to Ai and A2 on APM M2 are A'^ = T^^(Ai) and A'2 = T^^(A2) 
respectively. Both these programs can be assigned costs according to a cost 
measure C2 of APM M2. The cost measures C\ and C2 are consistent only if the 
increase in efficiency that has been obtained by the transformation from A\ to 
A2 carries over to APM M2, i.e., only if 

C2{T^I{A,)) > C2{t^:{A2)). 

This property is captured by the following definition of monotonicity. The trans- 
formation is monotonic with respect to the costs C\ and C2, if for arbitrary 
programs A\ and A2 

Ci(Ai) > CM2) implies C2{T^l{Ai)) > C2{T^l{A2)). 

The bottom-up construction of cost measures according to an APM hierarchy 
creates monotonic cost measures; this can be proven by a bottom-up induction 
over the APM tree using definition J 2 I) of cost measures. In the next section, we 
describe the PRAM model with different operation sets in the APM methodol- 
ogy. For this example, we do not use the bottom-up cost definition but use the 
standard costs of the PRAM model. 

3.4 Other Cost Models 

One of the most popular parallel cost models is the PRAM model jSl and its 
extensions. Because none of the PRAM models was completely satisfactory, a 
number of other models have been proposed that are not based on the existence 
of a global memory, including BSP and logP [b] . Both provide a cost calcu- 
lus by modeling the target architecture with several parameters that capture its 
computation and communication performance. The supersteps in the BSP model 



Cost Hierarchies for Abstract Parallel Machines 



25 



allow for a straightforward estimation of the runtime of complete programs El. 
Another cost modeling method is available for the skeleton approach used in the 
context of functional programming dCl In the skeleton approach, programs 
are composed of predefined building blocks {data skeletons) with a predefined 
computation and communication behavior which can be combined by algorith- 
mic skeletons capturing common design principles like divide and conquer. This 
structured form of defining programs enables a cost modeling according to the 
compositional structure of the programs. A cost modeling with monads has been 
used in the GOLDFISH system El- In contrast to those approaches, APM costs 
allow the transformation of costs from different parallel programming models, 
so they can be used to guide transformations between multiple programming 
models. 

4 APM Description of PRAMs 

The PRAM (parallel random access machine) is a popular parallel programming 
model in theoretical computer science, widely used to design and analyze par- 
allel algorithms for an idealized shared memory machine. A PRAM consists of 
a bounded set of processors and a common memory containing a potentially 
unlimited number of words E|. Each processor is similar to a RAM that can 
access its local random access memory and the common memory. A PRAM algo- 
rithm consists of a number of PRAM steps which are performed in a SIMD-like 
fashion; i.e., all processors needed in the algorithm take part in a number of 
consecutive synchronized computation steps in which the same local function is 
performed. One PRAM step consists of three parts: 

1. the processors read from the common memory; 

2. the processors perform local computation with data from their local memo- 
ries; and 

3. the processors write results to the common memory. 

Local computations differ because of the local data used and the unique identi- 
fication number idi of each processor Pi, for i = 1, ... ,n (where n is the number 
of processors) . 

4.1 An APM for PRAMs 

There are many ways to describe the PRAM within the APM framework. In 
the PRAM model itself, the processors perform operations that cause values 
to be obtained from and sent to the common memory, but the behavior of the 
common memory itself is not modeled in detail. It is natural, therefore, to treat 
each PRAM processor as an APM site, and to treat the common memory as an 
implicit agent which is not described explicitly. The APM framework allows this 
abstract picture: transactions between the processors and the common memory 
are specified as Input/Output transactions between the APM operations and 
the surrounding environment. 



26 



John O’Donnell, Thomas Rauber, and Gudnla Riinger 



The APM description of the PRAM model provides three ParOps for the 
PRAM substeps and a coordination language that groups the ParOps into steps 
and composes different steps while guaranteeing the synchronization between 
them. No other ParOps are allowed in a PRAM algorithm. The specific com- 
putations within the parallel operations needed to realize a specific algorithm 
are chosen by an application programmer or an algorithm designer. These local 
functions and the PRAM ParOps constitute a complete PRAM program. 

The three ParOps of a PRAM step are READ, EXECUTE and WRITE. The 
state (si, . . . , Sn) denotes the data st in the local memory of processor Pi {i = 
1 . . . n) involved in the PRAM program. The data from the common memory 
are the input (a:i, . . . ,Xr) to the local computation and the data written back 
to the common memory are the output (r/i, . . . , yr) of the computation. 

1. In the read step, data from the common memory are provided and for each 
processor Pi the function gi picks appropriate values from (xi, . . . , Xr) which 
are then stored by the local behavior function fi = store in the local memory, 
producing a new local state s'. There are no internal values produced in this 
step, so a dummy placeholder _ is used for the Zi term. The exact behavior of 
a specific READ operation is determined by the specific gi functions, which 
the programmer supplies as an argument to READ. Thus the ParOp defined 
here is READ {gi , . . . , 5 „). 

READ (gi, . . . ,g„) (si,...,s„) {xi,...,Xr) = ((s'i,...,s'„), (.)) 
where (s',_) = store(si, gi (V)) 

V = ((xi,...,Xr), 

= g(V) 

2. In a local computation, each processor applies a local function fi to the 
local data Si in order to produce a new state s'. The substep for the local 
computation does not involve the common memory, so the input and output 
vectors x and y are empty. In this operation, the programmer must supply 
the argument function /, which determines the behaviors of the sites. 

EXECUTE (/,...,/) (si,...,s„) (.) = ((s'i,...,s'J, ()) 
where (s',_) = /(si,_) 

R = 

= 9{V) 

3. At the end of a PRAM step, data from the local states (si, . . . , s„) are writ- 
ten back to the common memory. Each local function fi selects data Zi from 
its local state s^. From those data {z\, . . . , Zn) the function g forms the vector 
(yi, . . . , yr), which is the data available in the common memory in the next 
step. If two different processors select the same variable d, then the func- 
tion go is capable of modelling the different write strategies corresponding 
to different PRAM models [Ej. The programmer specifies the local fetch 
functions fi to determine the values that are extracted from the local states 
in order to be sent to the common memory. 
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WRITE (si,...,s„) (_) = 

where {s'i,Zi) = ft (si,_) 

V = (0,01, . . . ,z„) 

(yi,...,t/t) = g (V) 

The function g((-), Ai, . . . , An) = {A \, . . . , An) produces an output vector 
with one value from each processor in the order of the processor numbers. 

The PRAM steps can be combined by a sequential coordination language 
with for-loops and conditional statements. The next subsection gives an example. 



4.2 Example Program: PRAM Multiprefix 

More complex operations for PRAMs can be built up from the basic PRAM 
step. This section illustrates the process by defining the implementation of a 
multiprefix operation on a PRAM APM that lacks a built-in multi-prefix oper- 
ation. Initially, the common memory contains an input array X with elements 
Xi, for 0 < * < n; after the n sites execute a multiprefix operation with addition 
as the combining operation, the common memory holds a result array B, where 
for 0 < i < n. 

Several notations are used to simplify the presentation of the algorithm and 
to make it more readable, see Figure E| (right). Upper case letters denote vari- 
ables in the common memory, while lower case letters are used for the local site 
memories. The / and g functions are specified implicitly by describing abstract 
PRAM operations; the low level definitions of these functions are omitted. Thus 
the notation READ bi := Xi means that a PRAM READ operation is performed 
with / and g functions defined so as to perform the parallel assignment. Similar 
conventions are used for the other operations, EXECUTE and WRITE. Further- 
more, a constraint on the value of i in an operation means that the operation 
is performed only in those sites where the constraint is satisfied; other sites do 
nothing. (Again, the constraints are implemented through the definitions of the 
/ and g functions.) 

Figure El (left) gives the program realizing the multiprefix operation |2j. The 
algorithm first copies the array X into Y. This is done in 0(1) time by perform- 
ing a parallel READ that stores Xi into site i, enabling the values to be stored in 
parallel into Y with a subsequent WRITE. Then a loop with logn steps is exe- 
cuted. In step j, each processor Pi (for 2^ < i < n) reads the value accumulated 
by processor Pi_ 2 J , and it adds this to its local value of bi. The other processors. 
Pi for 0 < i < 2-1, leave their local bi value unchanged. The resulting array is 
then stored back into Y in the common memory. All operations are executed in 
the READ/EXECUTE/WRITE scheme required by the PRAM model. The ini- 
tialization, as well as each of the logn loop iterations, requires 0(1) time, so the 
full algorithm has time cost O(logn). 
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READ bi Xi is an abbreviation for 


Initially: inputs are in Xq, . . . , Xn-i. 




READ with 


Result: Yi = X/j-o 0 < i < n 




gi{{xo, ■ . ■ = Xi 

s'i = Si[xi/bi], i = 0, . . . ,n - 1 


procedure parScanl (©,n, A, F) 




EXECUTE bi := Xi + bi abbreviates 


READ bi ~ Xi for 0 < i < n 




EXECUTE with 


EXECUTE _ 




/(Si,_) = {s'i,.) 


WRITE Yi := bi for 0 < i < n 




s'i = Si[xi + bi/bi] 


for j 0 to log n — 1 do 




WRITE Yi ~ bi is an abbreviation for 


READ Xi := Yi_ 2 j for 2^ < i < n 




WRITE with 


EXECUTE bi := Xi + bi for 2^ < i < n 




/(Si,_) = {s'i,bi) 


WRITE Yi — bi for 0 < i < n 




5((_),feo,...,fe„_i) = (Fo,...,F„_i) 
with Yi — bi, i = 1, . . . ,n — 1 



Fig. 4. Realization of a multiprefix operation expressed as an algorithm in a PRAM-APM. 



4.3 PRAM* with Built-In Multiprefix 

The PRAM model can be enriched with a multiprefix operation (also called 
parallel scan), which is considered to be one PRAM step. The steps of a program 
written in the enriched model may be either multiprefix operations or ordinary 
PRAM steps. The multiprefix operation with addition as a combining operation 
(MPADDL) has the same behavior as the parScanl procedure in Figure 0] (left); 
the only difference is that it is defined as a new PRAM ParOp, so its definition 
does not use any of the other ParOps READ, EXECUTE or WRITE. 

MPADDL (so,..., sn-i) (Ao,...,A„_i) = ((s(„ . . . , s'„_i), (Fq, . . . , I"n-i)) 
where (s'i,&i) = fi (so 9i V) 

V = {{Xq, ■ ■ ■ ,Xn-l),bo, ■ ■ ■ ,bn-l) 

where the functions /o, . . . , fn-i,g, ■ ■ ■ , gn-i are defined as follows: 



gi{{Xo , . . . , Xn-i),bo , . . . , bn-i) — (Aq, . . . , Xi-i) 

i 

fi{si, (Ao, . . . , Ai_i)) = ((s',&i) with bj = y^Xj,i=l,...,n-l 

j=o 

g{{Xo, . . . , Xn-i),bo, . . . ,bn-i) = (Yq, . . . ,Yn-i) with Yi := bi,i = 1, . . . ,n - 1 

A family of related operations can be treated as PRAM primitives in the 
same way. In addition to multiprefix addition from the left (MPADDL), we can 
also form the sums starting from the right (MPA DDR). There are correspond- 
ing operations for other associative functions, such as maximum, for which we 
can define MPMAXL and MPMAXR. Reduction (or fold) ParOps for associative 
functions are also useful, such as FOLDMAX, which finds the largest of a set of 
values chosen from the sites. Several of these operations will be used later in an 
example (Section 0. 
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Maximum Segment Sum 



Costs 
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MSS sequential 'MSS sequential 0(n) ' 0(n log n ) 
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PRAM-APM 



MSS parallel 
with prefix 



0( log n) 



PRAM-APM 
Built-in Prefix 



MSS parallel 

with built-in prefix 



0(1 ) 



Fig. 5. Illustration of the transformation steps of the maximum segment sum. 

4.4 PRAM Costs 

The PRAM model defines the cost of an algorithm to be the number of PRAM 
steps that have to be performed by that algorithm, i.e., a PRAM step has cost 
1. Thus, the cost of the multiprefix operation for the PRAM without a built-in 
multiprefix operation is the number of steps executed by the implementation 
in Figure 0 In the PRAM with built-in multiprefix, the cost for a multiprefix 
operation is 1. In Section IrT^ we have described a way to define costs starting 
from the leaves of the APM hierarchy. This is also possible for the PRAM and 
is useful if algorithms have to be transformed to a real machine. An example of 
a real machine with built-in multiprefix is the SB-PRAM PJ, a multi-threaded 
architecture supporting multiprefix operations for integers in hardware. On this 
machine, each PRAM step and each multiprefix operation takes two cycles, in- 
dependently from the data values contributed by the different processors. This 
can then be used as the cost measure of a leaf machine. 

5 Example: MSS Algorithm Transformation 

This section illustrates how an algorithm can be transformed within the APM 
framework in order to improve its efficiency. Figure 0 shows the organization of 
the transformation, which moves from the sequential RAM to the parallel PRAM 
models, and which includes both horizontal and vertical transformations. 

We use as an example a version of the well-known maximum segment sum 
(MSS) problem similar to that presented by Akl 0 . Given a sequence of numbers 
X = Xq, . . . , A„_i, the problem is to find the largest possible sum mss of a 
contiguous segment within X. Thus we need the maximal value of Xi 

such that 0 < u < V < n. Using prefix sums and maxima, the problem can be 
solved by the following steps: 

1. For * = 0, . . . , n — 1 compute Si = ^j- 

2. For i = 0, . . . , n — 1 compute Mi = maxi<j<„ Sj; let at be the value of j at 
which Mi is found. 
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3. For * = 0, . . . , n — 1 compute Bi = Mi — Si + Xi. 

4. compute mss = maxo<i<n-i Bi] if u is the index at which the maximum 
is found, the maximum sum subsequence extends from u to v = Qu- 

The first version of the algorithm (Figure El left) is written in a conventional 
imperative style, except that the RAM programming model is made explicit. 
The RAM has only one ParOp, an EXECUTE operation, and the RAM-APM 
has only one site. The time of the algorithm is obtained by observing that the 
number of EXECUTE operations performed is 0{n), and each takes time 0(1) 
(this result can be obtained either by analyzing the RAM APM, or by assuming 
constant time for this leaf machine operation) . 



Initially: inputs are in Xq, . . . , Xn-i- 
Result: mss = max Xj\ 

for 0 < M < f < n 

EXECUTE a := 0 
for i := 0 to n — 1 do 
EXECUTE Si~a + Xi 
EXECUTE a ■- Si 
EXECUTE a ■- NEGINF 
for i ■.= n — 1 downto 0 do 
EXECUTE Mi ■- max {Si, a) 
EXECUTE a ■- Mi 
for i := 0 to n — 1 do 

EXECUTE Bi ■- Mi - Si + Xi 
EXECUTE mss := NEGINF 
for i := 0 to n — 1 do 

EXECUTE mss = max {mss, Bi) 



procedure seqGopy (n, Y, Z) 
for i := 0 to n — 1 do 
EXECUTE Yi ■- Zi 
procedure seqScanl (/, n. A, B) 
seqCopy {n, B, A) 
for j := 0 to log n — 1 do 
for i ■.= 2^ to n — 1 do 

EXECUTE Bi :=f{Bi_^i, Bi) 
procedure seqScanr (/, n. A, B) 
seqCopy (n, B, A) 
for j := 0 to log n — 1 do 

for i ~ n — 1 — 2^ downto 0 do 
EXECUTE B, ~ f {Bi, Bi+^i) 
function seqFoldll (/, a, n. A) 

EXECUTE q--a 
for i := 1 to n — 1 do 
EXECUTE q~ f {q,Ai) 
return q 
begin 

seqScanl (+, n, X, S) 
seqScanr {max, n, S, M) 
for i := 0 to n — 1 do 

EXECUTE Bi ■- Mi-Si+ Xi 
mss := seqFoldll {max, NEGINF, n, B) 
end 



Fig. 6. Algorithm 1 on the left: the sequential MSS expressed within the RAM-APM 
with time complexity 0{n) (NEGINF is the absolute value of the largest negative number.) 
and Algorithm 2 on the right: the sequential MSS using a sequential multiprefix operation 
expressed within the same . RAM-APM with 0(n log n) time. 



The aim is to speed up the algorithm using a parallel scan on the PRAM 
model. To prepare for this step, we first perform a horizontal transformation that 
restructures the computation, making it compatible with the prefix operations. 
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This results in Algorithm 2 (Figure 0 right). It is interesting to observe that the 
transformation has actually resulted in a slower algorithm; its benefit is to put 
us into a position to perform further transformations that will more than make 
up for this slow-down. 

We now perform a vertical transformation from the RAM to the PRAM 
models, producing Algorithm 3 (Figure | 7 | (left)). As usual with vertical transfor- 
mations, there is very little change to the structure of the algorithm; the main 
effect of the transformation is to use the PRAM to perform the iterations without 
data dependencies in 0(1) time. Algorithm 3 is still using the basic PRAM, so 
the parallel prefix operations require O(logn) time. The final step. Algorithm 4 
(Figure 0 (right)), is produced by performing a vertical transformation onto the 
PRAM* model, which supports parallel prefix operations as built-in operations 
with a cost of 0(1) time. 



procedure parScanl (/, n, A, B) 
Defined in Figure 4 
procedure parScanr (/, n. A, B) 
Similar to parScanl 
function parFold (/, n. A) 

Similar to parScanl 
begin 

parScanl {+,n, X, S) 
parScanr {max, n, S, M) 

READ mi := Mi, Si := Si,Xi := Xi 
for 0 < i < n 

EXECUTE bi := mi — Si + Xi 
for 0 < i < n 

WRITE Bi bi for 0 < i < n 
mss := parFold {max, n, B) 

end 



MPADDL {n,X,S) 




MPMAXR {n,S,M) 




READ mi := Mi,Si := 


II 


for 0 < i < n 




EXECUTE bi ■- mi - 


Si -F Xi 


for 0 < i < n 




WRITE Bi ■- bi for 0 


< i < n 


mss ~ FOLDMAX {n 


B) 



Fig. 7. Algorithm 3 on the left: MSS Parallel Scan. PRAM APM, O(logn) time. Algorithm 
4 on the right: MSS Parallel Scan. PRAM* APM, 0(1) time. 



6 Conclusion 

The three components of the APM methodology — an APM with its ParOps, the 
ParOp cost models and the algorithm — reflect the abstract programming model, 
the architecture of a parallel system, and the parallel program. The architecture 
is represented here very abstractly in the form of costs. 

In general, all three components of the methodology are essential. The costs 
are an important guide to the design of an efficient parallel algorithm. Sepa- 
rating the costs from the APMs allows several different ones to be associated 
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with the same operation, representing the same functionality implemented on 
different machines. However, it is not enough just to keep the semantics of the 
APM parallel operations and their costs: the APM definitions are still needed, 
as they make explicit the organization of data and computation into sites. Ef- 
ficient algorithm design for parallel machines must consider not only when an 
computation on data is performed, but also where. An example is programming 
on a distributed memory machine where a poorly chosen distribution of data to 
sites may cause time-consuming communication. 

We therefore conclude that all three components of the methodology are 
essential, but they should be separated from each other and made distinct. For 
solving a particular problem, some parts of the structure may not be needed, 
and can be omitted. For example, a programmer may find the intuition about 
cost provided by the ParOp definitions is sufficient, in which case the separate 
cost model structure is unnecessary. 
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Abstract. This paper describes a new approach to the analysis of de- 
pendencies in complex, pointer-based data structures. Structural infor- 
mation is provided by the programmer in the form of two-variable finite 
state automata (2FSA). Our method extracts data dependencies. For 
restricted forms of recursion, the data dependencies can be exact; how- 
ever in general, we produce approximate, yet safe (i.e. overestimates de- 
pendencies) information. The analysis method has been automated and 
results are presented in this paper. 



1 Introduction 

We present a novel approach to the analysis of dependencies in complex, pointer- 
based data structures which builds on our earlier work |M2Sj. The input to our 
analysis is structural information of pointer-based data structures, and algo- 
rithms that work over those structures. We consider algorithms that update 
the data in the nodes of the structure, although we disallow structural changes 
that would result from pointer assignment. The structural specification is given 
by identifying a tree backbone and additional pointers that link nodes of the 
tree. These links are described precisely using two- variable finite state automata 
(2FSA). 

We can then produce dependency information for the program. This details 
for each runtime statement the set of statements that either read or write into 
the same node of the structure. Some of this information may be approximate, 
but we can check that it is conservative in that the correct set of dependencies 
will be a subset of the information we produce. For even quite small sections of 
program the output may be dauntingly complex, but we explore techniques for 
reducing it to a tractable size and extracting useful information. 

The paper is organised as follows: we outline the notion of a 2FSA description 
in Section Q and describe the restricted language that we use in Section rz. II In 
Section 13. II we look at an example of a recursive rectangular mesh, and follow 
through with the description and analysis of a simple piece of program. We deal 
with a more complex example in Section El We describe related works in Section 
El with conclusions and plans for future work in Section El 

S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 304-1223 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 
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2 Structure Descriptions Using 2FSA 

We first observe how dynamic data structures are handled in a language such 
as C, and then relate this to our approach. Consider the following example of a 
tree data structure: 

struct Tree { 
int data; 

Tree * dl; 

Tree * d2; 

Tree * d3; 

Tree * d4; 

Tree * r; 

>: 



The items data, dl, d2, d3, d4 and r are the fields of the structure, and may 
contain items of data (such as data) or pointers to other parts of the structure. 
We assume here that dl, d2, d3, d4 point to four disjoint subtrees, and the r 
pointer links nodes together across the structure. 

We next explain how this structure is represented. We have a fixed list of 
symbols, the alphabet A, that corresponds to the fields in the structure. We 
define a subset, G C A, of generators. These pointers form a tree backbone for 
this structure, with each node in the structure being identified by a unique string 
of symbols, called the pathname (a member of the set G*), which is the path 
from the root of the tree to that particular node. Therefore dl, d2, d3 and d4 
are the generators, in the example. 

Our description of the structure also contains a set of relations, pi C G* xG* , 
one for each non-generator or link field i. This relation links nodes that are joined 
by a particular pointer. A node may be joined to more than one target node via 
a particular link. This allows approximate information to be represented. It is 
useful to consider each relation as a function from pathnames to the power set 
of pathnames: U : G* ^ V{G*)] each pathname maps to a set of pathnames 
that it may link to. In our example r is such a link. A word is a string of fields; 
we append words to pathnames to produce new ones. 

We represent each relation as a two-variable finite state automaton (2FSA). 
These are also known as Left Synchronous Transducers (see for example pM]). 
We recall that a (deterministic) Finite State Automaton (FSA) reads a string of 
symbols, one at a time, and moves from one state to another. The automaton 
consists of a finite set of states S, and a transition function F : S x A ^ S, 
which gives the next state for each of the possible input symbols in A. The string 
is accepted, if it ends in one of the accept states when the string is exhausted. 

A two-variable FSA attempts to accept a pair of strings and inspects a symbol 
from each of them, at each transition. It can be thought of as a one- variable FSA, 
but with the set of symbols extended to Ax A. There is one subtlety, in that we 
may wish to accept strings of unequal lengths, in which case the shorter one is 
padded with the additional ’ symbol. This results in the actual set of symbols 
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to be: ((Au{— }) x (Au{— }))\(— , — ), since the double padding symbol is never 
needed. 

We can utilise non-deterministic versions of these automata. As already de- 
scribed, deterministic 2FSAs allow only one transition from a fixed state with 
a fixed symbol; non-deterministic ones relax this condition by allowing many. 
Most manipulations work with deterministic 2FSAs, but on occasions it may 
be simpler to define a non-deterministic 2FSA with a particular property, and 
determinise it afterwards. 

We also use 2FSAs to hold and manipulate the dependency information that 
we gather. There are other useful manipulations of these 2FSAs which we use in 
our analysis. 

— Logical Operations. We can perform basic logical operations such as AND 
(a), or (V), and NOT on these 2FSAs. 

— The Exists and Forall automata. The one variable FSA 3(F), accepts x if 
there exists a y such that {x, y) G F. Related to this is the 2FSA V(i?) built 
from the one variable FSA R, that accepts (x, y) for all y, if R accepts x. 

— Composition. Given 2FSAs for individual fields, we wish to combine the 
multiplier 2FSAs into one multiplier for each word of fields that appear in 
the code. For instance, we may wish to find those parts of the structure 
which are accessed by the word a . b given the appropriate 2FSAs for a and 
b. This composition 2FSA can be computed: given two FSAs, TZ and S, their 
composition is denoted by TZ.S. This is defined a s : {x,y) € TZ.S, if there 
exists a z, such that (x, z) GTZ and (z, y) G S. See |RCH+92j for the details 
of its construction. 

— The inverse of 2FSA F, denoted which is built by swapping the pair 

of letters in each transition of T . 

— The closure of 2FSA F, denoted F*, which we discuss later in Section FOl 

With the exception of the closure operation, these manipulations are exact 
for any 2FSAs. 

2.1 Program Model 

We work with a fairly restricted programming language with C-like syntax that 
manipulates these structures. The program operates on one global data struc- 
ture. Data elements are accessed via pathnames, which are used in the same 
manner as conventional pointers. The program consists of a number of possibly 
mutually recursive functions. These functions take any number of pathname pa- 
rameters, and return void. It is assumed that the first (main) function is called 
with the root path name. Each function may make possibly recursive calls to 
other functions using the syntax ‘Func(w->g)’ where g is any field name. 

The basic statements of the program are reads and writes to parts of the 
structure. A typical read/write statement is ‘w->a = w->b’ where w is a variable 
and a and b are words of directions, and denotes the copying of an item of data 
from w->b to w->a within the structure. Note that we do not allow structures 



Safe Approximation of Data Dependencies in Pointer-Based Structures 307 



to be changed dynamically by pointer assignment. This makes our analysis only 
valid for sections of algorithms where the data in an existing structure is being 
updated, without structural changes taking place. 



3 The Analysis 

3.1 A Rectangular Mesh Structure 

We work through the example of a rectangular mesh structure, as shown in 
Figure into demonstrate how a 2FSA description is built up from a specification. 
These mesh structures are often used in finite-element analysis; for example, to 
analyse fluid flow within an area. The area is recursively divided into rectangles, 
and each rectangle is possibly split into four sub-rectangles. This allows for a 
greater resolution in some parts of the mesh. We can imagine each rectangle 
being represented as a node in a tree structure. Such a variable resolution mesh 
results in an unbalanced tree, as shown in Fig.Q Each node may be split further 
into four subnodes. Each rectangle in the mesh has four adjacent rectangles that 
meet along an edge in that level of the tree. For example, the rectangle 4 on the 
bottom right has rectangle 3 to the left of it and rectangle 2 above. This is also 
true for the smallest rectangle 4, except that it has a rectangle 3 to its right. We 
call these four directions I, r, d and u. The tree backbone of the structure has 
four generators dl,d2,d3 and d4 and linking pointers l,r,d and u. We assume 
that these links join nodes at the same level of the tree, where these are available, 
or to the parent of that node if the mesh is at a lower resolution at that point. 
So moving in direction u from node d3 takes us to node dl, but going up from 
node d3.dl takes us to node dl, since node dl.d3 does not exist. 

We next distill this information into a set of equations that hold this linkage 
information. We will then convert these equations into 2FSA descriptions of the 
structure. Although they will be needed during analysis, the generator descrip- 
tions do not have to be described by the programmer as they are simple enough 
to be generated automatically. 

Let us now consider the r direction. Going right from a dl node takes us 
to the sibling node d2. Similarly,we reach the d4 node from every d3. To go 
right from any d2 node, we first go up to the parent, then to the right of that 
parent, and down to the dl child. If that child node does not exist then we link 
to the parent. For the d4 node, we go to the d3 child to the right of the parent. 
This information can be represented for the r direction by the following set of 
equations (where x is any pathname) : 



{x.dl) = x.d2 


( 1 ) 


{x.d2) = r{x).dl\r{x) 


(2) 


(x.d3) = x.dA 


( 3 ) 


(x.d4) = r{x).dS\r{x) 


( 4 ) 
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d3 . d4 . dl d3 . d4 . d2 



Fig. 1. Above: the variable resolution rectangular mesh. Below: The mesh is 
represented as a tree structure with the l,r,d,u links to adjacent rectangles. 
(Not all the links are shown, for the sake of brevity.) 
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d{x.dl) = x.d?) 
d{x.d2) = a:.ci4 
d{x.dS) — d(a:).cil|d(a:) 
d{x.d4) = d(a:).d2|d(a:) 

u{x.dl) = u{x).d3\u{x) 
u{x.d2) — u{x).d4\u{x) 
u{x.d3) = x.dl 
u{x.d4) = x.d2 



l{x.dl) — l{x).d2\l{x) 
l{x.d2) = x.dl 
l{x.d3) — l{x).d4\l{x) 
l{x.d4) = x.d3 




Fig. 2. Equations and 2FSA descriptions for the link directions in the mesh 
structure. 
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These equations are next converted into a 2FSA for the r direction. Note 
that one pathname is converted into another by working from the end of the 
pathname back to the beginning. The rule ‘r(cc.(i4) = r{x) .d3\r{x)’ does the 
following: if <i4 is the last symbol in the string, replace it with a dl or an e, then 
apply the r direction to the remainder of the string. This is similarly true for 
the ‘r{x.d2) = r(a:).(il|r(a:)’ rule. In the rule ‘r{x.dl) = x.d2’ the last dl symbol 
is replaced by a d2, and then outputs the same string of symbols as its input. 
Viewed in this way, we can create a 2FSA that accepts the paths, but in reverse 
order. 

The correct 2FSA is produced by reversing the transitions in the automata 
and exchanging initial and accept states. Note that the presence of transitions 
labelled with e may make this reversal impossible. We can, however, use this 
process to produce 2FSA descriptions for the other link directions l,d and u. 
Figure 0 illustrates the remaining set of equations and 2FSA descriptions. 

3.2 The Program 



main (Tree *root) f 
if (root != NULL) { 

A: traverse (root->d2) ; 

B: traverse (root->d4) ; 

} 

> 

traverse (Tree *t) f 
if (t!=NULL) f 

C: sweepl(t); 

D: traverse (t->dl) ; 

E: traverse (t->d2) ; 

F : traverse (t->d3) ; 

G: traverse (t->d4) ; 

> 

> 

sweepKTree *x) f 
if (x!=NULL) f 

H: x->l->data = x->data + x->r->data; 

I: sweepl(x->l) ; 

} 

> 

Fig. 3. Sample recursive functions. The A,B,C,D,E,F,G,H,I are statement 
labels and are not part of the original code. 



The recursive functions in Fig. 0operate over the rectangular mesh structure 
illustrated in Fig. 0 The function main branches down the right hand side of 
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the tree calling traverse to traverse the whole sub-trees. The function sweepl 
then propagates data values out along the 1 field. 

The descriptions for single fields can be used to build composite fields. For 
instance, the update in statement H requires the composite fields: r — > data and 
I data, both built up from the basic descriptions by composing the 2FSAs. 
Each runtime statement is uniquely identified by a control word, defined by la- 
belling each source code line with a symbol, and forming a word by appending the 
symbol for each recursive function that has been called. Each time the traverse 
function is called recursively from statement D, we append a D symbol to the 
control word. Source code line H therefore expands to the set of runtime control 
words {A\B).{D\E\F\GY.C.r.H. 

A pathname parameter to a recursive function is called an induction param- 
eter. Each time we call the function recursively from statement D, we append a 
dl field to the induction parameter t and for statement E we similarly append 
a d2 field. The same is true for statements F and G. This information can be 
captured in a 2FSA that converts a given control word into the value of the 
parameter t, for that function call. The resulting 2FSA is shown in Fig. El 









(D,d1) 


I 


(A,d2) 

(B,d4) 




(E,d2) 

(F,d3) 








(C,d4) 



I 



Fig. 4. A 2FSA that maps control words to values of t. 



3.3 Building a General Induction Parameter 2FSA 

In general for each induction parameter Vi in each function, we need to create a 
2FSA, Fvi , that maps all control words to the pathname value of that parameter 
at that point in the execution of the program. Fig. 0 outlines the construction 
of such a 2FSA. 

If a function is called in the form F ( . . . , Vk , . . . ) where Vk is an induction 
parameter, then the substitution 2FSA will contain transition symbols of the 
form {A, e). We aim to remove these by repeated composition with a 2FSA that 
will remove one e at a time from the output. Provided that the transition is not 
in a recursive loop, i.e., we can bound the number of times that the function 
can be called in any run of the program, applying this repeatedly will remove 
all such transitions. The construction of this epsilon-removal 2FSA is outlined 
in Fig. 0 
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Since the program in Fig.|2|has a function sweepl that recurses by updating 
its induction parameter by a non-generator direction 1, we have to approximate 
the dependency information for this function. This is because, in general, the 
required induction parameter relation will not necessarily be a 2FSA; hence we 
approximate with one. 



— Create a non-deterministic 2FSA with a state for each induction variable, plus 
an additional initial state. State i -I- 1 (corresponding to variable i) is the only 
accepting state. 

— For each pathname parameter j of the main function, add an epsilon transition 
from state 1 to state j + 1- 

— For each induction variable fc, we seek all call statements of the form A: 
F(. . . ,Vk->g, . . .). If Vk->g is passed as variable m in function F, then we add 
transition from state fc-|-l to state m-l-1, with symbol {A,g). Here we assume that 
g is a generator. If g is empty, we use an epsilon symbol in its place, then attempt 
to remove it later. If g is a non-generator we still use the symbol (A,g), but we 
need to apply the closure approximation described later. 

— Determinise the nondeterministic 2FSA. 

Fig. 5. The Substitution 2FSA for the variable Vi 



— Create a 2FSA with one state for each generator (and link), plus another three 
states, an initial state, an accepting final state and an epsilon state. 

— Add transitions from initial state, (d, d) symbols where d is a field (generator or 
link), make the transition back to the initial state, (e, d) symbols move to the state 
for that field {d + 1). (e, — ) symbols move to the final state. 

— Add transitions for each field state, (field i corresponds to state i -f 1). Symbol 
(i, i) leave it at state i -|- 1. {i,d) moves it to state d+ 1. {i, — ) moves to the final 
state. 

— Add transitions for the epsilon state, (e, d) moves to d -f 1. (e, e) leaves it at the 
epsilon state, (e, — ) moves it to the final state. 

— There are no transitions from the final state. 

Fig. 6. The remove e 2FSA 



3.4 Formation of Approximate Closures of 2FSA 

We next consider the problem of producing a 2FSA that safely approximates 
access information of functions that recurse in the non-generator fields. This 
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sort of approximation will give coarse-grained access information for a whole 
nest of recursive calls. 

We first take the simplest example of a function that calls itself recursively, 
which appends a non-generator field to the induction parameter at each call. We 
then show how this can be used to produce approximate information for any 
nest of recursive calls. Consider the following program fragment: 

sweepl(Tree *x) { 
if (x!=NULL) { 

H: x->l->data = x->data + x->r->data; 

I: sweepl(x->l) ; 

} 

> 



Recursion in one field can be approximated by a number of steps in that 
field with a 2FSA that approximates any number of recursions. In the second 
iteration of the function, the value of x will be 1 appended to its initial value. 
Iteration k can be written as and can be readily computed for any finite 

value of k, although the complexity of the 2FSA becomes unmanageable for a 
large k. We wish to approximate all possible combinations in one 2FSA. 

Definition 1. The relation [=] is the equality relation, (x,y) G [=] x = y. 

Definition 2. The closure of the field p, written p* , is defined as 

OO 

P*=y p^ 

k=0 



where p^ is defined as [=] 

In the example of the 1 field, the closure can be represented as a 2FSA, and 
the approximation is therefore exact. In general, however, this is not always the 
case, but we aim to approximate it safely as one. A safe approximation S' to a 
relation R, implies that if {x, y) G R, then (x, y) G S. 

We have developed a test to demonstrate that a given 2FSA i? is a safe 
approximation to a particular closure: we use a heuristic to produce R, and then 
check that it is safe. 

Theorem 1. If R and p are relations such that R D R.p\J [=], then R is a safe 
approximation to p* . 

Proof. Firstly, R D R.pV [=] i? D R.p^~^^ ^ (Proof: induction 

on k). If (x,y) G p* , then {x,y) G p'" for some integer r. Applying above with 
k = r implies that (x, y) G R. 
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Therefore, given a 2FSA, R, that we suspect might be a safe approximation, 
we can test for this by checking if i? 3 R.p V [=]. This is done by forming 
i?A {R.pV [=]), and testing equality with R.pV [=]. It is worth noting that safety 
does not always imply that the approximation is useful; a relation that accepts 
every pair (x, y) is safe but will probably be a poor approximation. 

To prove an approximation exact we need equality in the above comparison 
and an additional property; that for every node x travelling in the direction of 
p we will always reach the edge of the structure in a finite number of moves. 

Theorem 2. If R and p are relations such that R = R.p V [=], and for each 
X there exists a kx such that for all y, {x,y) ^ then R = p* , and the 

approximation is exact. 

Proof. If (x,y) G R, then (x,y) G V^=i ^ P*- Since (x,y) cannot be in 

R.p^^, it must be in p* for some i < kx. So (x, y) G p*. 

Unfortunately we cannot always use this theorem to ascertain that an approx- 
imation is exact, as we know of no method to verify that the required property 
holds for an arbitrary relation. However, this property will often hold for a link 
field (it does for most of the examples considered here), so an approximation can 
be verified for exactness. 

The method that we use to generate these closures creates a (potentially 
infinite) automaton. In practice we use a number of techniques to create a small 
subset of this state space: 

— Identifying fail states. 

— ‘Folding’ a set of states into one approximating state. The failure transitions 
for a state is the set of symbols that leads to the failure state from that state. 
If two states have the same set of failure transitions, we assume they are the 
same state. 

— Changing transitions out of the subset so that they map to states in the 
subset. 

We have tested the closures of around 20 2FSAs describing various structures. 
All have produced safe approximations, and in many cases exact ones. 

3.5 Generalising to Any Loop 

We can use these closure approximations in any section of the function call graph 
where we have a recursive loop containing a link field in addition to other fields. 
Starting from the 2FSA for the value of an induction variable v, we build up 
an expression for the possible values of v using Arden’s rules for converting an 
automaton to a regular expression. This allows us to solve a system of i equations 
for regular expressions Ei . In particular, we can find the solution of a recursive 
equation Ei = Ei.E 2 \Es as 



El — E‘i.{E2) 
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The system of equations is solved by repeated substitution and elimination of 
recursive equations using the above formula. We thus obtain a regular expression 
for the values of v within the loop. We can then compute an approximation for 
this operation using OR, composition and closure manipulations of 2FSAs. 



3.6 Definition and Use 2FSAs 

For each read statement, AT, that accesses p->w, we can append to J-p the state- 
ment symbol, X, and the read word, w, for that statement. Thus if J-p accepts 
{C,y), then this new 2FSA will accept {C.X,y w), and is formed from two 
compositions. The conjunction of this set of 2FSAs produces another 2FSA, Xr, 
that maps from any control word to all the nodes of the structure that can be 
read by that statement. Similarly, we can produce a 2FSA that maps from con- 
trol words to nodes which are written, denoted by J^w These write and read 
2FSAs, are derived automatically by the dependency analysis. 

1. The definition 2FSA, that accepts the pair (control word, path-name) if the 
control word writes (or defines) that node of the structure. 

2. The use 2FSA for describing nodes that are read by a particular statement. 



We can now describe all the nodes which are read from and written to, by 
any statement. By combining the read and write 2FSAs we create a 2FSA that 
link statements when one writes a value that the other reads, i. e the statements 
eonflict. The conflict 2FSA, is given by We can now describe 

all the nodes which have been read from, and written to, by any statement. By 
combining these read and write 2FSAs, we can now create 2FSAs that links 
statements for each of the possible types of dependency (read after write, write 
after read and write after write). The conjunction of these 2FSAs forms the 
conflict 2FSA: 



^conf^ ^rfJ^w) ^ U TwX^r) ^ U Xw-i^w) ^ 

We may be interested in producing dataflow information, i.e. associating 
with each read statement, the write statement that produced that value. We 
define the causal 2FSA, J'Qdusal "'^hich accepts control words X, Y, only if V 
occurs before X in the sequential running of the code. The 2FSA in Figj^is 
the conflict 2FSA for the example program, which has also been ANDed with 
^causah remove any false sources that occur after the statement. 



3.7 Partitioning the Computation 

We now consider what useful information can be gleaned from the conflict graph 
with regard to the parallel execution of the program. Our approach is as follows: 
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Fig. 7. The conflict 2FSA for the example code. 
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Fig. 8. The values read by and written to by each of the threads. 



— Define execution threads by splitting the set of control words into separate 
partitions. Each partition of statements will execute as a separate thread. 
At present these are regular expressions supplied by the programmer. 

— Use the conflict 2FSA to compute the dependencies between threads. 

We apply this approach to the running example. We can take as our two 
partitions of control words: 

1. {A).{D\E\F\G)*.C.r.H 

2. lB).\D\E\F\G)*.C.r.H 

We can then compute the total set of values read by and written to each 
statement in each thread. This is shown in FigjSl Since there is no overlap the 
two threads are completely independent and can therefore be spawned safely. 

4 A Second Example 

We next describe how this approach can be applied to a larger program, as shown 
in Fig. 13 Consider the function calls in statements A and B being spawned as 
separate threads. The resulting conflict 2FSA for this example has 134 nodes, too 
large to interpret by manual inspection. We describe an approach for extracting 
precise information that can be used to aid the parallelisation process. 
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main (Tree *node) {. 
if (node!=NULL) i 

A: update (node) ; 

B: main (node->dl) ; 

C: main (node->d2) ; 

} 

> 

update (Tree *w) {. 
if (w!=NULL) i 

D: sweepl(w); 

E: propagate (w) ; 

} 

> 

propagate (Tree *p) { 
if (p!=NULL) i 

p->data = p->l->data + p->r->data 

F: + p->u->data + p->d->data; 

G: propagate (p->dl) ; 

H: propagate (p->d2) ; 

I: propagate (p->d3) ; 

J: propagate (p->d4) ; 

} 

> 

sweepKTree *x) {. 
if (x!=NULL) i 

K: x->l->data = x->data; 

L: sweepl(x->l) ; 

} 

> 



Fig. 9. The second example program 



A particular statement, X say, in the second thread may conflict with many 
statements in the first thread. Delaying the execution of X until all the conflicting 
statements have executed ensures that the computation is carried out in the 
correct sequence. We can therefore compute information that links a statement 
to the ones whose execution it must wait for. 

We map X to Y, such that if X conflicts with Z, then Z must be executed 
earlier than Y. We produce the wait-for 2FSA by manipulating the conflict and 
casual 2FSAs: 



^wait-for ^conf^^'^^^^conf^ 



causa 



l) 



We use this approach to trim the conflict information and produce the one in 
Fig. El We are interested in clashes between the two threads, so only the portion 



Safe Approximation of Data Dependencies in Pointer-Based Structures 319 




Fig. 10. The ‘waits-for’ information 



that begins with the {B,A) symbol is relevant. We extract the information that 
the statements given by B.A.D.L.{L)* .K must wait for A.E.G.F to execute. 
This implies that to ensure the threads execute correctly we would need to insert 
code to block these statements from executing until the statement A.E.G.F has 
done so. 



4.1 Implementation 

We have implemented the techniques described in this paper in C-| — h using the 
2FSA manipulation routines contained in the ‘kbmag’ library [Hoi] . In addition 
we use the graph visualisation application ‘daVinci ’ ESI for viewing and print- 
ing our 2FSA diagrams. We ran our analysis codes on an 400MHz Pentium II 
Gnu/Linux machine. The smaller example took 283 seconds, and the larger one 
497 seconds. Much of this time is spent in the closure computation routines, and 
in the case of the larger example the production of the waits-for information 
from the large conflict graph. 

5 Related Work 

The ASAP approach fHHN94bj uses three different types of axiom to store in- 
formation about the linkages in the structure. 



1) Vp p.REl 7^ p.RE2 

2) yp^q p.REl ^q.RE2 

3) Vp p.REl = p.RE2 

where p, q are any path names and REl, RE2 are regular expressions over the 
alphabet of fields . These axioms are then used to prove whether two pointers can 
be aliased or not. In comparison, our properties are of the formp.i?i?I.n = p.RE2 
where REl and RE2 are regular expressions in the generator directions and n is 
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a link. They are slightly more powerful in that the 2FSA description allows REl 
and RE2 to be dependent. Provided that the regular expressions do not include 
link directions, we can express ASAP axioms as 2FSAs and combine them into 
our analysis. In fact, even if the expression has link directions inside a Kleene star 
component, we can use the closure approximation method to approximate this 
information and use it in our analysis. ASAP descriptions are an improvement on 
an earlier method, ADDS, which used different dimensions and linkages between 
them to describe structures. Thus ASAP is more suited to structures with a 
multi-dimensional array-like backbone. 

Comparison with the ADDS/ ASAP description for a binary tree with linked 
leaves (see IHHIN Ddfl ) . shows that our specification is much more complicated. To 
justify this, we demonstrate that the 2FSA description can resolve dependencies 
more accurately. Consider the ADDS description of the binary tree, 

type Bintree [down] [leaves] { 

Bintree *left, bright is uniquely forward along down; 

Bintree *next is uniquely forward along leaves; 

} 



This description names two dimensions of the structure down and leaves. 
The left and right directions form a binary tree in the down direction, the next 
pointers create a linked list in the leaves dimension. The two dimensions are 
not described as disjoint, since the same node can be reached via different routes 
along the two dimensions. Now consider the following code fragment, where two 
statements write to some subnodes of a pointer ’p’. 

p->l->next->next = ... 

p->r = . . . 

Dependency analysis of these statements will want to discover if the two 
pointers on the left sides can ever point to the same node. Since the sub-directions 
contain a mixture of directions from each dimension [down] and [leaves] , 
ADDS analysis must assume conservatively that there may be a dependence. 
The 2FSA description however can produce a 2FSA that accepts all pathnames 
p for which these two pointers are aliased. This FSA is empty, and thus these 
writes are always independent. 

The ‘Graph Types’ [IK St) ,3] are not comparable to ours since they allow de- 
scriptions to query the type of nodes and allow certain runtime information, such 
as whether a node is a leaf to be encoded. If we drop these properties from their 
descriptions then we can describe many of their data structures using 2FSAs. 
This conversion can probably not be done automatically. Their work does not 
consider dependency analysis; it is mainly concerned with program correctness 
and verification. |S,T,TK97j extends this to verification of list and tree programs, 
and hints that this analysis could be extended to graph types. 

Shape types uses grammars to describe structures, which is more 

powerful than our 2FSA-based approach. Structure linkages are represented by a 
multi-set of field and node names. Each field name is followed by the pair of nodes 
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that it links. The structure of the multi-set is given by a context-free grammar. 
In addition, operations on these structures are presented as transformers, rewrite 
rules that update the multi-sets. However, many of the questions we want to ask 
in our dependence analysis may be undecidable in this framework. If we drop 
their convention that the pointers from a leaf node point back to itself, we can 
describe the dependence properties of their example structures (e.g. skip lists of 
level two, binary trees with linked leaves and left-child, right-sibling trees) using 
our formalism. 

In innnBi, the authors use formal language methods (pushdown automata 
and rational transducers) to store dependency information. The pushdown au- 
tomata are used for array structures, the rational transducers for trees; more 
complex structures are not considered. Rational transducers are more general 
than 2FSAs, since they allow e transitions that are not at the end of paths. This 
extension causes problems when operations such as intersections are considered. 
Indeed only ones which are equivalent to 2FSAs can be fully utilised. Handling 
multidimensional arrays in their formalism will require a more sophisticated lan- 
guage. Our method is not intended for arrays, although they can be described 
using 2FSAs. However, the analysis will be poor because we have to approximate 
whichever direction the code recurses through them. 

allows programs to be annotated with reachability expressions, which 
describe properties about pointer-based data structures. Their approach is gen- 
erally more powerful than ours, but they cannot, for example, express properties 
of the form x.REl = y.RE2. Our 2FSAs can represent these sort of dependen- 
cies. Their approach is good for describing structures during phases where their 
pointers are manipulated dynamically, an area our method handles poorly at 
best. Their logic is decidable, but they are restricted in that they do not have a 
practical decision algorithm. 

[IFea,98j looks at dependencies in a class of programs using rational transducers 
as the framework. These are similar, but more general than the 2FSAs we use 
here. The generality implies that certain manipulations produce problems that 
are undecidable, and a semi-algorithm is proposed as a partial solution. Also the 
only data structures considered are trees, although extension to doubly linked 
lists and trees with upward pointers is hinted at. However, even programs that 
operate over relatively simple structures like trees can have complex dependency 
patterns. Our method can be used to produce the same information from these 
programs. This indicates a wider applicability of our approach from the one 
involving structures with complex linkages. 

In jl )eii94) the author aims to extract aliasing information directly from the 
program code. A system of symbolic alias pairs SAPs is used to store this infor- 
mation. Our 2FSAs form a similar role, but we require that they are provided 
separately by the programmer. For some 2FSAs the aliasing information can 
be expressed as a SAP. For example, the information for the n direction in the 
binary tree can be represented in (a slightly simplified version of) SAP notation 
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The angle brackets hold two expressions that aliased for all values of the 
parameters k \ , /c 2 that satisfy the condition k\ = k 2 - SAPs can handle a set of 
paths that are not regular, so they are capable of storing information that a finite 
state machine cannot. These expressions are not, however, strictly more powerful 
than 2FSAs. For example, the alias information held in the 2FSA descriptions 
for the rectangular mesh cannot be as accurately represented in SAP form. 

In IL{,SH8I the authors describe a method for transforming a program into 
an equivalent one that keeps track of its own dataflow information. Thus each 
memory location stores the statement that last wrote into it. This immediately 
allows methods that compute alias information to be used to track dataflow 
dependencies. The method stores the statement as a line number in the source 
code, rather than as a unique run time identifier, such as the control words 
we use. This means that dependencies between different procedures, or between 
different calls of the same procedure, will not be computed accurately. Although 
we describe how dataflow information can be derived by our approach, we can 
still produce alias information about pointer accesses. So the techniques of 
could still be applied to produce dependency information. 

6 Conclusions and Future Work 

This paper builds on our earlier work in which the analysis was restricted to pro- 
grams which recursed in only the generator directions. We have now extended the 
analysis, to non-generator recursion. The approximation is safe in the sense that 
no dependencies will be missed. The next step would be to extend the analysis 
to code that uses pointer assignment to dynamically update the structure. 

As it stands the computation of the waits-for dependency information in- 
volves a large intermediate 2FSA, which will make this approach intractable for 
longer programs. We are currently working on alternative methods for simplify- 
ing the conflict 2FSA that would avoid this. 

We are also pursuing the idea of attaching probabilities to the links in the 
structure description. A pointer could link a number of different nodes together, 
with each pair being associated with a probability. We could then optimise the 
parallelisation for the more likely pointer configurations while still remaining 
correct for all possible ones. 
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Abstract. This paper presents a set of proposals for the OpenMP 
shared-memory programming model oriented towards the definition of 
thread groups in the framework of nested parallelism. The paper also de- 
scribes the additional functionalities required in the runtime library sup- 
porting the parallel execution. The extensions have been implemented in 
the OpenMP NanosCompiler and evaluated in a set of real applications 
and benchmarks. In this paper we present experimental results for one 
of these application^ 



1 Introduction 

Parallel architectures are becoming affordable and common platforms for the 
development of computing-demanding applications. Users of such architectures 
require simple and powerful programming models to develop and tune their par- 
allel applications with reasonable effort. These programming models are usually 
offered as library implementations or extensions to sequential languages that 
express the available parallelism in the application. Language extensions are de- 
fined by means of directives and language constructs (e.g. OpenMP |H|, which is 
the emerging standard for shared-memory parallel programming) . 

In general, multiple levels of parallelism appear in the majority of numerical 
applications in science and engineering. Although OpenMP accepts the specifica- 
tion of multiple levels of parallelism (through the nesting of parallel constructs), 
most current programmers only exploit a single level of parallelism. This is be- 
cause their applications achieve satisfactory speed-ups when executed in mid-size 
parallel platforms or because most current systems supporting OpenMP (com- 
pilers and associated thread-level layer) sequentialize nested parallel constructs. 
Exploiting a single level of parallelism means that there is a single thread (mas- 
ter) that produces work for other threads (slaves). Once parallelism is activated, 
new opportunities for parallel work creation are ignored by the execution en- 
vironment. Exploiting a single-level of parallelism (usually around loops) may 
incur in low performance returns when the number of processors to run the 

^ The reader is referred to the extended version of this paper (available at 
http://www.ac.upc.es/nanos) for additional details about the rest of applications. 
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application increases. When multiple levels of parallelism are allowed, new op- 
portunities for parallel work creation result in the generation of work for all or 
a restricted number of threads (as specified by the NUM_THREADS clause in v2.0 
of the OpenMP specification) . 

In current practice, the specification of these multiple levels of parallelism is 
done through the combination of different programming models and interfaces, 
like for example the use of MPI coupled with either OpenMP or High Perfor- 
mance Fortran P|. The message passing layer is used to express outer levels of 
parallelism (coarse grain) while the data parallel layer is used to express the 
inner ones (fine grain). The parallelism exploited in each layer is statically de- 
termined by the programmer and there are no chances to dynamically modify 
these scheduling decisions. 

Other proposals consists in offering work queues and an interface for insert- 
ing application tasks before execution, allowing several task descriptions to be 
active at the same time (e.g. the Illinois-Intel Multithreading library 0). Kuck 
and Associates, Inc. has also made proposals to OpenMP to support multi-level 
parallelism through the WorkQueue mechanism cni, in which work can be cre- 
ated dynamically, even recursively, and put into queues. Within the WorkQueue 
model, nested queuing permits a hierarchy of queues to arise, mirroring recursion 
and linked data structures. These proposals offer multiple levels of parallelism 
but do not support the logical clustering of threads in the multilevel structure, 
which we think is a necessary aspect to allow good work distribution and data 
locality exploitation. The main motivation of this paper is to make a proposal 
for an extension of the OpenMP programming model to define how the user 
might have control over the work distribution when dealing with multiple levels 
of parallelism. 

Scaling to a large number of processors faces with another challenging prob- 
lem: the poor interaction between the parallel code and the deep memory hi- 
erarchies of contemporary shared-memory systems. To surmount this problem, 
some vendors and researchers propose the use of user-directed page migration 
and data layout directives with some other directives and clauses to perform 
the distribution of work following the expressed data distribution [T^ P . Our 
proposal relieves the user from this problem and proposes clauses that allow 
the programmer to control the distribution of work to groups of threads and 
ensure the temporal reuse of data (both across multiple parallel constructs and 
multiple instantiations of the same parallel construct). Although not used in the 
experimental evaluation in this paper, we assume the existence of a smart user- 
level page migration engine 0. This engine, in cooperation with the compiler, 
tries at runtime to accurately and timely fix both poor initial page placement 
schemes (emulating static data distributions in data parallel languages) and in 
some cases, follow the dynamic page placement requirements of the application. 

The paper is organized as follows. Section 2 describes the extensions to the 
OpenMP programming model proposed in this paper. These extensions have 
been included in the experimental NanosCompiler based on Parafrase-2 P]. Sec- 
tion 3 describes the main functionalities of the run-time system that supports 
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the proposed extensions. Experimental results are reported in Section 4. Finally, 
some conclusions are presented in Section 5. 

2 Extensions to OpenMP 

This section presents the extensions proposed to the OpenMP programming 
model. They are oriented towards the organization of the threads that are used 
to execute the parallelism in a nest of PARALLEL constructs when exploiting 
multiple levels of parallelism and the allocation of work to them. 

2.1 OpenMP v2.0 

We assume the standard execution model defined by OpenMP Q. A program 
begins execution as a single process or thread. This thread executes sequentially 
until the first parallel construct is found. At this time, the thread creates a 
team of threads and it becomes its master thread. The number of threads in the 
team is controlled by environment variables, the NUM_THREADS clause, and/or 
library calls. Therefore, a team of threads is defined as the set of threads that 
participate in the execution of any work inside the parallel construct. Threads 
are consecutively numbered from zero to the number of available threads (as 
returned by the intrinsic function omp_get_num_threads) minus one. The master 
thread is always identified with 0. Work-sharing constructs (DO, SECTIONS and 
single) are provided to distribute work among the threads in a team. 

If a thread in a team executing a parallel region encounters another (nested) 
parallel construct, most current implementations serialize its execution (i.e. a 
new team is created to execute it with only one thread). The actual definition 
assumes that the thread becomes the master of a new team composed of as many 
threads as indicated by environment variables, the NUM_THREADS clause applied 
to the new parallel construct, and/or library calls. The NUM_THREADS clause al- 
lows the possibility of having different values for each thread (and therefore 
customizing the amount of threads used to spawn the new parallelism a thread 
may find). 

In the next subsections we propose a generalization of the execution model 
by allowing the programmer to define a hierarchy of groups of threads which are 
going to be involved in the execution of the multiple levels of parallelism. We also 
propose to export the already existing thread name space and make it visible 
to the programmer. This allows the specification of work allocation schemes 
alternative to the default ones. The proposal to OpenMP mainly consists of two 
additional clauses: GROUPS and ONTO. Moreover, the proposal also extends the 
scope of the synchronization constructs with two clauses: GLOBAL or LOCAL. 

2.2 Definition of Thread Groups 

In our proposal, a group of threads is composed by a number of consecutive 
threads following the active numeration in the current team. In a parallel con- 
struct, the programmer may define the number of groups and the composition 
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of each one. When a thread in the current team encounters a parallel construct 
defining groups, the thread creates a new team and it becomes its master thread. 
The new team is composed of as many threads as groups are defined; the rest of 
threads are reserved to support the execution of nested parallel constructs. In 
other words, the groups definition establishes the threads that are involved in 
the execution of the parallel construct plus an allocation strategy or scenario for 
the inner levels of parallelism that might be spawned. When a member of this 
new team encounters another parallel construct (nested to the one that caused 
the group definition), it creates a new team and deploys its parallelism to the 
threads that compose its group. Groups may overlap and therefore, a thread 
may belong to more than one group. 

Figure E shows different definitions of groups. The example assumes that 8 
threads are available at the time the corresponding parallel construct is found. In 
the first scenario, 4 groups are defined with a disjoint and uniform distribution 
of threads; each group is composed of 2 threads. In the second scenario, groups 
are disjoint with a non uniform distribution of the available threads: 4 groups are 
defined with 4, 2, 1 and 1 threads. The third scenario shows a uniform definition 
of overlapped groups. In this case 3 groups are defined, each one with 4 threads; 
therefore, each group shares 2 threads with another group. 
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Fig. 1. Examples of group definitions. 

The GROUPS Clause. This clause allows the user to specify any of the afore- 
mentioned group definitions. It can only appear in a PARALLEL construct or 
combined PARALLEL DO and PARALLEL SECTIONS constructs. 

C$0MP PARALLEL [DO I SECTIONS] [GROUPS (gspec)] 

The most general format for the groups specifier gspec allows the specification 
of all the parameters in the group definition: the number of groups, the identifiers 
of the threads that participate in the execution of the parallel region, and the 
number of threads composing each group: 

GROUPS (ngroups , masters, howmany) 

The first argument (ngroups) specifies the number of groups to be defined and 
consequently the number of threads in the team that is going to execute the 
parallel construct. The second argument (masters) is an integer vector with the 
identifiers (using the active numeration in the current team) of the threads that 
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will compose the new team. Finally, the third argument (howmany) is an integer 
vector whose elements indicate the number of threads that will compose each 
group. The vectors have to be allocated in the memory space of the application 
and their content and correctness have to be guarenteed by the programmer. 
This format for the GROUPS clause allows the specification of all the scenarios 
shown in Figure ^ For example, the second scenario in this figure could be 
defined with: 
ngroups = 4 

masters [4] = {0, 4, 6, 7} 
howmany [4] = {4, 2, 1, 1} 

Our experience in the use of thread groups has shown that, in a large number 
of real applications, the definition of non-overlapping groups is sufficient. This 
observation motivated the following alternative format for the groups specifier 
gspec: 

GROUPS (ngroups , weight) 

In this case, the user specifies the number of groups (ngroups) and an integer 
vector (weight) indicating the relative weight of the computation that each 
group has to perform. From this information and the number of threads available 
in the team, the runtime is in charge of computing the two previous vectors 
(masters and howmany). This eases the use of groups because the programmer 
is relieved from the task of determining their exact composition. Vector weight is 
allocated by the user in the application address space and it has to be computed 
from information available within the application itself (for instace iteration 
space, computational complexity or even information collected at runtime) . 

In our current implementation, the composition of the groups is computed 
using a predefined algorithm. The algorithm assigns all the available threads to 
the groups and ensures that each group at least receives one thread. The main 
body of the algorithm is as follows (using FortranUO sintax) : 
howmany (1 ingroups) = 1 

do while (sum(howmany(l ingroups) ) .It. nthreads) 
pos = maxloc (weight (1 ingroups)/ 

howmany (1 ingroups) ) 
howmany (pos (1) ) = howmany (pos (1) ) + 1 
end do 

masters (1) = 0 
do i = 1 , ngroups-1 

masters(i+l) = masters(i) + howmany(i) 
end do 

In this algorithm, nthreads is the number of threads that are available to spawn 
the parallelism in the parallel construct containing the group definition. Notice 
that the last scenario in Figure Q cannot be expressed using this format because 
it requires overlapping groups. 

A second shortcut is used to specify uniform non-overlapping groups, i.e. when 
all the groups have similar computational requirements: 

GROUPS (ngroups) 
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The argument ngroups specifies the number of groups to be defined. This format 
assumes that work is well balanced among groups and therefore all of them 
require the same number of threads to exploit inner levels of parallelism. The 
runtime is in charge of computing the composition of each group by simply 
equidistributing the available threads among the groups. Notice that this format 
only allows the specification of the first scenario in Figure ^ 

Finally, the last format allows the user to specify an scenario in which the 
number of non-overlapping groups is constant and can be specified in the source 
code of the application. The programmer also specifies the relative weight of 
each group. The format is as follows: 

GROUPS (gdef [,gdef] ) 

where each gdef has the following format: 
gdef = [name] :nthreads 

The number of groups to be defined is established by the number of gdef fields 
in the directive. Each group definer includes an optional name (which can be 
used to identify the groups at the application level, see next subsection) and an 
expression nthreads specifying the weight of the group. The runtime applies the 
same algorithm to compute the actual composition of each group. 

2.3 Work Allocation 

OpenMP establishes a thread name space that numbers the threads composing 
a team from 0 to the total number of threads in the team minus one. The master 
of the team has number 0. Our proposal consists on exporting this thread name 
space to the user with the aim of allowing him to stablish a particular mapping 
between the work divided by the OpenMP work-sharing constructs and the 
threads in the group. This is useful to improve data locality along the execution 
of multiple parallel constructs or instantiations of the same parallel construct. 



The ONTO Clause. The work-sharing constructs in OpenMP are DO, SECTIONS 
and SINGLE. The ONTO clause enables the programmer to change the default 
assignment of the chunks of iterations originated from a SCHEDULE clause in 
a DO work-sharing construct or the default lexicographical assignment of code 
sections in a SECTIONS work-sharing construct. 

The syntax of the ONTO clause applied to a DO work-sharing construct is as 
follows: 

C$0MP DO [ONTO (target)] 

The argument target is an expression that specifies which thread in the current 
team will execute each particular chunk of iterations. If the expression contains 
the loop control variable, then the chunk number (numbered starting at 0) is 
used to determine which thread in the team has to execute it. The loop control 
variable is substituted by the chunk number and then a ’modulo the number of 
threads in the team ’ function is applied to the resulting expression. 

For instance, assume a parallel DO annotated with an 0NT0(2*i) clause, i 
being the loop control variable. This clause defines that only those members 
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of the team with even identifiers will execute the chunks of iterations coming 
from the loop. If i is a loop invariant variable, then the thread with identifier 
mod(2*i .nthreads) , nthreads being the number of threads in the team will 
execute all the chunks originated for the parallel execution of the loop. If not 
specified, the default clause is ONTO(i), i being the loop control variable of the 
parallel loop. 

For the SECTIONS work-sharing construct, the ONTO (target) clause is at- 
tached to each SECTION directive. Each expression target is used to compute 
the thread that will execute the statements parceled out by the corresponding 
SECTION directive. If the ONTO clause is not specified the compiler will assume 
an assignment following the lexicographical order of the sections. 

For the SINGLE work-sharing construct, the ONTO clause overrides the dy- 
namic nature of the directive, thus specifying the thread that has to execute the 
corresponding code. If not specified, the first thread reaching the work-sharing 
construct executes the code. 

2.4 Thread Synchronization 

The definition of groups affects the behavior of implicit synchronizations at the 
end of parallel and work-sharing constructs as well as synchronization constructs 
(master, critical and BARRIER). Implicit synchronizations only involve those 
threads that belong to the team participating in the execution of the enclos- 
ing parallel construct. However, the programmer may be interested in forcing 
a synchronization which affects to all the threads in the application or just to 
the threads that compose the group. In order to differentiate both situations, 
clauses GLOBAL and LOCAL are provided. By default, LOCAL is assumed. 

2.5 Intrinsic Functions 

One new intrinsic function has been added to the OpenMP API and two existing 
ones have been redefined: 

— Function omp_get_num_threads returns the number of threads that compose 
the current team (i.e. the number of groups in the scope of the current 
PARALLEL construct). 

— omp_get_threads returns the number of threads that compose the group 
the invoking thread belongs to (and that are available to execute any nested 
PARALLEL construct). 

— Function omp_get_thread_num returns the thread identifier within the group 
(between 0 and omp_get_num_threads-l). 

3 Example: MBLOCK Kernel 

In this section we illustrate the use of the proposed directives with MBLOCK, 
a generic multi-block kernel. For additional examples (which include some other 
kernels and SPEC95 benchmarks), the reader is referred to the extended version 
of this paper ^ . 
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3.1 MBLOCK Kernel 

The MBLOCK kernel simulates the propagation of a constant source of heat in 
an object. The propagation is computed using the Laplace equation. The output 
of the benchmark is the temperature on each point of the object. The object 
is modeled as a multi-block structure composed of a number of rectangular 
blocks. Blocks may have different size (defined by vectors nx, ny and nz). They 
are connected through a set of links at specific positions. Blocks are stored 
consecutively in a vector named a. The connections between blocks are stored 
in a vector named link in which each element contains two indices to vector a. 

The algorithm has two different phases. A first one where all data is read (sizes 
of the blocks and connections points). All points are assumed to have 0 degrees of 
temperature when starting the simulation except for a single point (ihot) that 
has a fixed (forced) temperature over the whole simulation time. The second 
phase is the solver, which consists of an iterative time-step loop that computes 
the temperature at each point of the object. The initialization done exploiting 
two levels of parallelism. The outer level exploits the parallelism at the level of 
independent blocks (loop iblock). The inner level of parallelism is exploited in 
the initialization of the elements that compose each block. The same multilevel 
structure appears in the solver (computation for each independent block and 
computation within each block) . The interaction between the points that belong 
to two blocks is done exploiting a single level of parallelism. 

Figure El shows the skeleton for an OpenMP parallel version of the kernel. In 
this version, groups of threads are defined using a vector that stores the size of 
each block. This vector is used to compute, at runtime, the actual composition of 
each group according to the total number of processors available. For instance. 
Figure Ela shows the composition of the groups when 32 processors are available 
and 9 independent blocks are defined in the application. Notice that the algo- 
rithm currently implemented assumes that at least a processor is assigned to 
each group. The rest of processors are assigned according to the weight of each 
group. 

Figure Qb shows a different allocation which assumes that the largest block 
is efficiently executed exploiting a single level of parallelism (i.e. devoting all 
the processors available to exploit the parallelism within each block). However, 
the rest of the blocks are not able to efficiently exploit this large number of 
processors. In this case, computing several blocks in parallel and devoting a 
reduced number of processors to compute each block leads to a more efficient 
parallel execution. Finally, several small blocks can also be executed sharing 
a single processor in order to increase the global utilization. This definition of 
groups requires the use of the general specifier for the GROUPS clause and the 
definition of a function that computes vectors masters and howmany. In the 
example, the vectors should be initialized as follows: 

ngroups = 9 

masters [9] = {0, 0, 15, 30, 30, 30, 31, 31, 31}- 
howmany[9] = {32, 15, 15, 1, 1, 1, 1, 1, 1} 



332 



Marc Gonzalez et al. 



Notice that the first group overlaps with the rest of groups and that the last 6 
groups are also overlapped in a single processor. 

4 The Run-Time Support: NthLib 

The execution model defined in section |2| requires some functionalities not avail- 
able in most runtime libraries supporting parallelizing compilers. These func- 
tionalities are provided by our threads library NthLib. 

4.1 Work Descriptors and Stacks 

The most important aspect to consider is the support for spawning nested par- 
allelism. Multi-level parallelism enables the generation of work from different si- 



PROGRAM MBLOCK 

C Initialize work and location 
do iblock=l , nblock 

work(iblock) = nx (iblock) *ny (iblock) *nz (iblock) 
enddo 

C Solve within a block 

10 tres=0.0 

a(ihot) =100000 . 0 

C$OMP PARALLEL DO SCHEDULE (STATIC) GROUPS (nblock, work) 

C$OMP& REDUCTION (+:tres) PRIVATE (res) 
do iblock=l , nblock 

call solve (a ( loc ( iblock) } , nx ( iblock) , ny ( iblock) , nz (iblock) , res , tol) 
tres=tres+res 
enddo 

C$OMP END PARALLEL DO 

C Perform inter block interactions 

C$OMP PARALLEL DO PRIVATE (val) 
do i=l,nconnect 

val= (a (link (i , 1) ) +a (link ( i , 2 ) ) ) /2 . 0 
a ( link (i , 1) ) =val 
a ( link (i , 2 ) ) =val 
enddo 

C$OMP END PARALLEL DO 

if (tres . gt . tol ) goto 10 

subroutine solve ( t , nx, ny / nz , tres, tol) 

res=0 . 0 

C$OMP PARALLEL DO SCHEDULE (STATIC) REDUCTION (+: res ) 

C$OMP& PRIVATEd, j ,k) 
do k=l,nz 
do j = 1 , ny 
do i=l,nx 

t (i , j , k) = (told ( i , j , k-1) +told (i , j , k+1 ) + 

+ told ( i , j - 1 , k) +told (i , j +1 , k) + 

+ told (i-l,j,k)+told(i+l,j,k) +told (i,j,k)*6.0)/12.0 

res=res+(t(i,j,k) -told ( i , j , k) ) **2 
enddo 
enddo 
enddo 

C$OMP END PARALLEL DO 
end 



Fig. 2. Source code for mblock benchmark. 
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a) 







Fig. 3. Groups defined assuming 32 processors and relative weights for the 9 
blocks in the multi-block structure {96,48,48,1,1,1,1,1,1}. 



multaneously executing threads. In this case, new opportunities for parallel work 
creation result in the generation of work for all or a restricted set of threads. 
NthLib provides different mechanisms to spawn parallelism, depending on the 
hierarchy level in which the application is running. When spawning the deepest 
(fine grain) level of parallelism, a mechanism based on work-descriptors is avail- 
able to supply the work to all the threads participating in the parallel region 
0. The mechanism is implemented as efficiently as the ones available in current 
thread packages. Although efficient, this mechanism does not allow the exploita- 
tion of further levels of parallelism. This requires the generation of work using 
a more costly interface that provides work descriptors with a stack jn|. Own- 
ing a stack is necessary for the higher levels of parallelism to spawn an inner 
level; the stack is used to maintain the context of the higher levels, along with 
the structures needed to synchronize the parallel execution, while executing the 
inner levels. 

In both cases, kernel threads are assumed to be the execution vehicles that get 
work from a set of queues where either the user or the compiler has decided to 
issue these descriptors. Each kernel thread is assumed to have its own queue 
from where it extracts threads to execute. These queues are identified with a 
global identifier between 0 and the number of available processors minus 1. 

4.2 Thread Identification 

The OpenMP extensions proposed in this paper 1) define a mechanism to control 
the thread distribution among the different levels of parallelism; and 2) export 
the thread name space to the application in order to specify the work distribution 
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among threads. The runtime library is in charge of mapping an identifier in the 
thread name space (identifier between 0 and nteam-1, nteam being the number 
of threads in the team the thread belongs to) to one of the above mentioned 
queues. 

The descriptor of a thread contains the context that identifies it within the 
group of threads it belongs to. The following information is available: 

1. nteam: indicates the number of threads in the team. 

2. rel_id: thread identifier relative to the team the thread belongs to: 0- 
nteam-1. 

3. abs_id: thread identifier absolute to the number of processors available: 0- 
omp_get_max_threads-l. 

4. master: it indicates the absolute identifier of the master of the team. 

4.3 Groups 

The descriptor also contains information about the number of threads available 
for nested parallel constructs and the definition of groups for them. 

5. nthreads: indicates the number of threads that the executing thread has 
available to spawn nested parallelism. 

6. ngroups: indicates the number of groups that will be created for the execu- 
tion of subsequent parallel regions. 

7. where: integer vector with the identifiers of the threads that will behave as 
the master of each new group. 

8. howmany: integer vector specifying the number of threads available within 
each defined group. 

The three last fields are computed when the GROUPS clause is found through 
the execution of a library function nthf _compute_groups by the thread that is 
defining the groups (master thread). 

Figure 0 shows an example of how these fields in the thread descriptor are 
updated during execution. The left upper part of the figure shows the source 
code with a group definition. On the left lower part there is the code emitted 
by the compiler with the appropriate run-time calls to spawn the parallelism 
expressed by the OpenMP directives. This code consists of a call to routine 
nthf _compute_groups and a loop where the run-time routine for thread creation 
is invoked (nthf .create). Notice that the loop will execute as many iterations 
as the number of groups that have to be created. In our example 4 groups will 
be created. On the right upper part of the same figure, we show the descriptor 
for the thread that finds the PARALLEL construct. It is assumed that the thread 
was already executing in a parallel region where another group definition was 
performed defining two groups. At that point, 16 threads were available and 8 
threads were assigned to it. Its identifier as master thread of the group is 8 (ab- 
soluteJd) and it has 8 threads reserved to spawn nested parallelism (nthreads 
= 8). Once function nthf _compute.groups is called by the master thread, its 
thread descriptor is updated with the appropriate information. The number of 
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groups that are going to be created is specified by the ngroups field. The thread 
grouping described by the GROUPS clause is used to fill the where and howmany 
vectors. The absolute thread identifiers in the where vector show the thread 
reservation for inner levels of parallelism. Those identifiers have been computed 
adding to the master absolute identifier the thread reservation for each group. 
When the master thread creates an slave thread, it creates the descriptors and 
initializes the fields rel_id, absolute_id, nthreads and nteam. The value for 
the rel_id is obtained from the variable nth_p in the compiler emitted code. 
The variable appears as a parameter of the call to the thread creation routine 
nthLcreate. The field absolute_id is filled copying the value from the corre- 
sponding cell in the where vector. The values for the fields nthreads and nteam 
reflect the fact that the thread belongs to a group. Notice that the numbers in 
vector howmany are copied to the field nthreads in the slave descriptors. This 
will cause that when those slaves reach a parallel construct, they will be able to 
spawn further parallelism on a limited number of threads. The field nteam is set 
to 4 as indicated by field ngroups in the master thread. 

For each GROUPS syntax described in the previous section, a run-time call is 
invoked to take the appropriate actions. The case where the programmer is just 
supplying the number of groups to be created and the vector indicating how 
many threads have to reserved for each group, is treated in a same manner as in 
the previous example. The vector where is empty, and the run-time computes 
the threads to be involved in the execution of the parallel region, following the 
algorithm described in the previous section. 



C$OMP PARALLEL GROUPS(temp:2,press:3,vel: l,accel:2) 

C$OMP DO SCHEDULE(STATIC) 
doi= 1, 1000 
enddo 

CSOMP DO SCHEDULE(STATIC) 
do i= 1,1000 
enddo 

CSOMP END PARALLEL 



rel_id - 0 
absolute_id - 0 
nteam - 1 
nthreads =16 
master - 0 
lowmember - 0 
ngroups — 2 
where |Q|8I?I?I 
howmany 

mmn 



call nthf_compute_groups(4,2,3, 1 ,2) 
do nth_p — 0,3 
call nthf_create(...,nth_p,...) 
enddo 



rel id = 0 




rel id = 1 


absolute id = 0 




absolute id = 8 


nteam - 2 




nteam — 2 


nthreads - 8 




nthreads - 8 


master - 0 




master - 0 


lowmember — 0 




lowmember — 8 


ngroups = ? 




ngroups - 4 


where |?|?|?|?| 




where isiunnM 


howmany 




howmany 


I?l?l7l?l 




2I3UI2 




rel_id - 1 
absolute_id = 10 
nteam - 4 
nthreads — 3 
master - 8 
lowmember = 10 
ngroups = ? 
where | | | | | 
howmany 



reljd = 2 
absolute_id =13 
nteam - 4 
nthreads - 1 
master — 8 
lowmember =13 
ngroups = ? 
where | | | | | 
howmany 



rel_id - 3 
absolute_id = 14 
nteam — 4 
nthreads - 2 
master - 8 
lowmember =14 
ngroups = ? 
where | | | | | 
howmany 



Fig. 4. Run-time example of group definition. 
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The BARRIER construct or the implicit synchronization at the end of each 
work-sharing construct implies a synchronization point for the members of the 
group and not in the global execution of the application. In any case, our proposal 
establishes the possibility to define a BARRIER with a local or global behavior 
(GLOBAL clause). In order to have a local or global behavior for the synchro- 
nizations points, the run-time library has to know the identifiers of the threads 
involved in the barrier. The run-time is able to determine which threads are 
composing a team because of the relative thread identifiers in a team that are 
always in the range 0 to nteam-l. Using the translation mechanism, it is possible 
to determine the absolute thread identifiers of the threads in a team. 



5 Experimental Evaluation 

In this section we evaluate the behavior of the MB LOCK kernel on a Silicon 
Graphics 0rigin2000 system El with 64 RlOk processors, running at 250 MHz 
with 4 Mb of secondary cache each. For all compilations we use the native f77 
compiler (flags -64 -0fast=ip27 -LND :pref etch^head=l : auto_dist=on). For 
an extensive evaluation, please refer to the extended version of this paper 

In this section we compare the performance of three OpenMP compliant par- 
allelization strategies and a parallel version which uses the extensions for groups 
proposed in this paper. The Omp Outer version corresponds to a single level par- 
allelization which exploits the existing inter-block parallelism (i.e. the blocks 
are computed in parallel and the computation inside each block are sequen- 
tialized). The Ompinner version corresponds to a single level parallelization in 
which the intra-block parallelism is exploited (i.e. the execution of the blocks 
is serialized). The Omp2Levels version exploits the two levels of parallelism and 
the GroupsOmp2Levels uses the clauses proposed in this paper to define groups. 
The program is executed with a synthetic input composed of 8 blocks: two with 
128x128x128 elements each and the rest with 64x64x64 elements. Notice that 
the size of the two large blocks is 8 times the size of the small ones. 

Figure 0 shows the speed-up of the four parallel versions with respect to the 
original sequential version. Performance figures for the OmpOuter version have 
been obtained for 1 and 8 processors. Notice that with 8 processors this version 
achieves an speed-up close to 3 due to the imbalance that appears between the 
large and small blocks. This version does not benefit from using more processors 
than the number of blocks in the computation. The speed-up for the Ompinner 
version is reasonable up to 32 processors. The efficiency of the parallelization is 
considerably reduced when more processors are used to execute the loops in the 
solver (due to insufficient work granularity) . The Omp2Levels reports some per- 
formance improvement but suffers from the same problem: each loop is executed 
with all the processors. The GroupsOmp2Levels version performs the best when 
more that 8 processors are used. In this version, the work in the inner level of 
parallelism is distributed following the groups specification. Therefore, 8 proces- 
sors are devoted to exploit the outer level of parallelism, and all the processors 
are distributed among the groups following the proportions dictated by the work 
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■ OmpOuter 

■ Omplnner 

□ Omp21evels 

□ GroupsOmp2Ievels2 



Fig. 5. Speed up for four different parallelization strategies of MBLOCK. 



vector in Figure |21 Therefore, the inner loops in the GroupsOmp2Levels version 
are executed with a number of processors that allow to continue improving per- 
formance while the number of available threads increases. 

For instance, notice that when 8 processors are available, the default distri- 
bution of threads performed by the runtime when the GROUPS clause is executed 
leads to a high load unbalance (see Figure 0a). The extensions proposed al- 
low the user to specify some group overlapping which results in a better load 
balancing (thus reducing the execution time from 32.96 to 15.69 seconds). 

6 Conclusions 

In this paper we have presented a set of extensions to the OpenMP program- 
ming model oriented towards the specification of thread groups in the context 
of multilevel parallelism exploitation. Although the majority of the current sys- 
tems only support the exploitation of single-level parallelism around loops, we 
believe that multi-level parallelism will play an important role in future systems. 
In order to exploit multiple levels of parallelism, several programming models 
can be combined (e.g. message passing and OpenMP). We believe that a single 
programming paradigm should be used and should provide similar performance. 
The extensions have been implemented in the NanosCompiler and runtime li- 
brary NthLib. We have analyzed the performance of some applications on a 
0rigin2000 platform. The results show that in these applications, and when the 
number of processors is high, exploiting multiple levels of parallelism with thread 
groups results in better work distribution strategies and thus higher speed ups 
than both the single level version and the multilevel version without groups. 
For instance, in a generic multi-block MBLOCK code, the performance is im- 
proved by a factor in the range between 1.5 and 2 (using a synthetic input with 
a small number of very unbalanced blocks). The speed-up would be higher when 
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the number of blocks is increased. For other benchmarks 0 the performance is 
improved by a similar factor. 
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Abstract. Processing and analyzing large volumes of data plays an in- 
creasingly important role in many domains of scientific research. We are 
developing a compiler which processes data intensive applications written 
in a dialect of Java and compiles them for efficient execution on cluster 
of workstations or distributed memory machines. 

In this paper, we focus on data intensive applications with two impor- 
tant properties; 1) data elements have spatial coordinates associated with 
them and the distribution of the data is not regular with respect to these 
coordinates, and 2) the application processes only a subset of the avail- 
able data on the basis of spatial coordinates. These applications arise 
in many domains like satellite data-processing and medical imaging. We 
present a general compilation and execution strategy for this class of ap- 
plications which achieves high locality in disk accesses. We then present 
a technique for hoisting conditionals which further improves efficiency in 
execution of snch compiled codes. 

Our preliminary experimental results show that the performance from 
our proposed execution strategy is nearly two orders of magnitude better 
than a naive strategy. Further, up to 30% improvement in performance 
is observed by applying the technique for hoisting conditionals. 



1 Introduction 

Analysis and processing of very large multi-dimensional scientific datasets (i.e. 
where data items are associated with points in a multidimensional attribute 
space) is an important component of science and engineering. An increasing num- 
ber of applications make use of very large multidimensional datasets. Examples 
of such datasets include raw and processed sensor data from satellites , out- 
put from hydrodynamics and chemical transport simulations ini; and archives 
of medical images P . 
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We are developing a compiler which processes data intensive applications 
written in a dialect of Java and compiles them for efficient execution on cluster 
of workstations or distributed memory machines gnini Our chosen dialect 
of Java includes data-parallel extensions for specifying collection of objects, a 
parallel for loop, and reduction variables. We extract a set of functions (in- 
cluding subscript functions used for accessing left-hand-side and right-hand-side 
object collections, aggregation functions, and range functions ) from the given 
data intensive loop by using the technique of interprocedural program slicing. 
Data partitioning, retrieval, and processing is performed by utilizing an existing 
runtime system called Active Data Repository 

Data intensive computations from a number of domains share two important 
charactestics. First, the input data elements have spatial coordinates associated 
with them. For example, the pixels in the satellite data processing application 
have latitude and longitude values associated with them m- The pixels in a 
multi-resolution virtual microscope image have the x and y coordinates of the 
image associated with them fp. Moreover, the actual layout of the data is not 
regular in terms of the spatial coordinates. Second, the application processes 
only a subset of the available data, on the basis of spatial coordinates. For 
example, in the satellite data processing application, only the pixels within a 
bounding box specified by latitudes and longitudes may be processed. In the 
virtual microscope, again, only the pixels within a rectangular region may be 
processed. 

In this paper, we present a compilation and execution strategy for this class of 
data intensive applications. In our execution strategy, the right-hand-side data 
is read one disk block at a time. Once this data is brought into the memory, 
corresponding iterations are performed. A compiler determined mapping from 
right-hand-side elements to iteration number and left-hand-side element is used 
for this purpose. The resulting code has a very high locality in disk accesses, 
but also has extra computation and evaluation of conditionals in every iteration. 
We present an analysis framework for code motion of conditionals which further 
improves efficiency in execution of such codes. 

The rest of the paper is organized as follows. The dialect of Java we target and 
an example of data intensive application with spatial coordinates is presented in 
Section 0 Basic compiler technique and the loop execution model are presented 
in Section El The technique for code motion of conditionals is presented in Sec- 
tion 0 Experimental results are presented in Section O We compare our work 
with related research efforts in Section El and conclude in Section 0 

2 A Data Intensive Application with Spatial Coordinates 

2.1 Data-Parallel Constructs 

We borrow two concepts from object-oriented parallel systems like Titanium, 
HPC-I— k, or Concurrent Aggregates [23]. 

— Domains and Rectdomains are collections of objects of the same type. Rect- 
domains have a stricter definition, in the sense that each object belonging 
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to such a collection has a coordinate associated with it that belongs to a 
pre-specified rectilinear section. 

— The foreach loop, which iterates over objects in a domain or rectdomain, and 
has the property that the order of iterations does not influence the result of 
the associated computations. 

We introduce a Java interface called Reducinterface. Any object of any class 
implementing this interface acts as a reduction variable HH. A reduction variable 
has the property that it can only be updated inside a foreach loop by a series of 
operations that are associative and commutative. Furthermore, the intermediate 
value of the reduction variable may not be used within the loop, except for self- 
updates. 

2.2 Satellite Data Processing Example 



Interface Reducinterface { 

// Any object of any class implementing 
// this interface is a reduction variable 

} 

public class pixel { 
short bands[5] ; 
short geo[2] ; 

} 

public class block { 
pixel bands[204*204] ; 
pixel getData(Point[2] p) { 

{ * Search for the (lat, long) on geo data *} 

{ * Return the pixel if it exists *} 

{ * Else return null 

} 

} 

public class SatData { 
public class SatOrigData { 
block[ld] satorigdata ; 

void SatOrigData(RectDomain[l] InputDomain) { 
satorigdata = new block[InputDomain] ; 

} 

pixel getData(Point[3] q) { 

Point[l] time = (q.get(O)); 

Point[2] p = (q.get(l), q.get(2)); 
return satorigdata[time] .getData(p) ; 

} 

} 

void SatData(RectDomain[l] InputDomain) { 
SatOrigD at a( InputDomain) ; 

} 

pixel getData(Point[3] q) { 
return SatOrigData(q) ; 

} 



public class Image 

implements Reducinterface { 
void Accumulate(pixel input ) { 

{ * Accumulation function *} 

} 

} 

public class Satellite { 

Point[l] LoEnd = ... 

Point[l] HiEnd = ... 

SatData satdata ; 

RectDomain[l] InputDomain = [LoEnd ; HiEnd]; 
satdata.SatData(Input Domain); 
public static void main(int[] args) { 

Point[l] lowtime = (args[Oj); 

Point[l] hightime = (args[lj); 

RectDomain[l] TimeDomain = [lowtime ; hightime]; 
Point [2] lowend = (args [2], args [4]); 

Point[2] highend = (args[3], args[5]); 

Rectdomain[2] OutputDomain = [lowend : highend]; 
Point [3] low = (args[0], args [2], args [4]); 

Point[3] high = (args[l], args[3], args[5j); 
Rectdomain[3] AbsDomain = [low ; high]; 

Image[2d] Output = new Image[OutputDomain] ; 

foreach (Point[3j q in AbsDomain) { 
if (pixel val = satdata. getData(q)) 

Point[2] p = (q.get(l), q.get(2)); 

Output [p] .Accumulate (val) ; 

} 

} 

} 



Fig. 1. A Satellite Data-Processing Code 



In Figure IQ we show the essential structure associated with the satellite 
data processing application 0 [03] . The satellites generating the datasets contain 
sensors for five different bands. The measurements produced by the satellite are 
short values (16 bits) for each band. As the satellite orbits the earth, the sensors 
sweep the surface building scan lines of 408 measurements each. Our data file 
consists of blocks of 204 half scan lines, which means that each block is a 204 x 204 
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array with 5 short integers per element. Latitude and longitude are also stored 
within the disk block for each measure. 

The typical computation on this satellite data is as follows. A portion of 
earth is specified through latitudes and longitudes of end points. A time range 
(typically 10 days to one year) is also specified. For any point on the earth within 
the specified area, all available pixels within that time-period are scanned and the 
best value is determined. Typical criteria for finding the best value is cloudiness on 
that day, with the least cloudy image being the best. The best pixel over each 
point within the area is used to produce a composite image. This composite 
image is used by researchers to study a number of properties, like deforestation 
over time, pollution over different areas, etc E). 

The main source of irregularity in this dataset and computation comes be- 
cause the earth is spherical, whereas the satellite sees the area of earth it is above 
as a rectangular grid. Thus, the translation from the rectangular area that the 
satellite has captured in a given band to latitudes and longitudes is not straight 
forward. 

We next explain the data-parallel Java code representing such computation 
(Figure QJ . The class block represents the data captured in each time-unit 
by the satellite. This class has one function (getData) which takes a (latitude, 
longitude) pair and sees if there is any pixel in the given block for that location. 
If so, it returns that pixel. The class SatData is the interface to the input dataset 
visible to the programmer writing the main execution loop. Through its access 
function getData, this class gives the view that a 3-dimensional grid of pixels is 
available. Encapsulated inside this class is the class SatOrigData, which stores 
the data as a 1-dimensional array of bands. The constructor and the access 
function of the class SatData invoke the constructor and the access function, 
respectively of the class SatOrigData. 

The main processing function takes 6 command line arguments as the input. 
The first two specify a time range over which the processing is performed. The 
next four are the latitudes and longitudes for the two end-points of the rectan- 
gular output desired. We consider an abstract 3-dimensional rectangular grid, 
with time, latitude, and longitude as the three axes. This grid is abstract, be- 
cause pixels actually exist for only a small fraction of all the points in this grid. 
However, the high-level code just iterates over this grid in the foreach loop. 
For each point q in the grid, which is a (time, lat, long) tuple, we examine 
if the block SatData [time] has any pixel. If such a pixel exists, it is used for 
performing a reduction operation on the object Output [(lat , long)] . 

The code, as specified above, can lead to very inefficient execution for at least 
two reasons. First, if the look-up is performed for every point in the abstract 
grid, it will have a very high overhead. Second, if the order of iterations in the 
loop is not carefully chosen, the locality can be very poor. 
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3 Compilation Model for Applications with Spatial 
Coordinates 

The main challenge in executing a data intensive loop comes from the fact that 
the amount of data accessed in the loop exceeds the main memory. While the 
virtual memory support can be used for correct execution, it leads to very poor 
performance. Therefore, it is compiler’s responsibility to perform memory man- 
agement, i.e., determine which portions of output and input collections are in 
the main memory during a particular stage of the computation. 

Based upon the experiences from data intensive applications and developing 
runtime support for them Pj 0, the basic code execution scheme we use is as 
follows. The output data-structure is divided into tiles, such that each tile fits 
into the main memory. The input dataset is read disk block at a time. This 
is because the disks provide the highest bandwidth and incur lowest overhead 
while accessing all data from a single disk block. Once an input disk block is 
brought into main memory, all iterations of the loop which read from this disk 
block and update an element from the current tile are performed. A tile from 
the output data-structure is never allocated more than once, but a particular 
disk block may be read to contribute to the multiple output tiles. 

To facilitate the execution of loops in this fashion, our compiler first performs 
loop fission. For each resulting loop after loop fission, it uses the runtime system 
called Active Data Repository (ADR) developed at University of Maryland |H EJ 
to retrieve data and stage the computation. 

3.1 Loop Fission 

Consider any loop. For the purpose of our discussion, collections of objects whose 
elements are modified in the loop are referred to as left hand side or lhs collec- 
tions, and the collections whose elements are only read in the loop are considered 
as right hand side or rhs collections. 

If multiple distinct subscript functions are used to access the right-hand-side 
(rhs) collections and left-hand-side (lhs) collections and these subscript func- 
tions are not known at compile-time, tiling output and managing disk accesses 
while maintaining high reuse and locality is going to be difficult. Particularly, 
the current implementation of ADR runtime support requires only one distinct 
RHS subscript function and only one distinct lhs subscript function. Therefore, 
we perform loop fission to divide the original loop into a set of loops, such that 
all LHS collections in any new loop are accessed with the same subscript function 
and all RHS collections are also accessed with the same subscript function. 

The terminology presented here is illustrated by the example loop in FigureEl 
The range (domain) over which the iterator iterates is denoted by TZ. Let there 
be n RHS collection of objects read in this loop, which are denoted by /i, . . . , 
Similarly, let the LHS collections written in the loop be denoted by Oi, . . . , Om- 
Further, we denote the subscript function used for accessing right hand side 
collections by Sr and the subscript function used for accessing left hand side 
collections by 5/,. 
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foreach(r £ TV) { 

Oi[5i,(r)] opi = A(/i[5ij(r)],...,7„[5H(r)]) 

Om[SL{r)] opm= ytm(7i[5fl(r)], . . . , 7„[5fl(r)]) 

} 



Fig. 2. A Loop In Canonical Form After Loop Fission 



Given a point r in the range for the loop, elements 5i(r) of the lhs collections 
are updated using one or more of the values 7i[5ij(r)], . . . , 7„[iS/j(r)], and other 
scalar values in the program. We denote by Ai the function used for updating 
LHS collection Oi. 

Consider any element of a RHS or lhs collection. Its abstract address is 
referred to as its l-value and its actual value is referred to as its r-value. 

3.2 Extracting Information 

Our compiler extracts the following information from a given data-parallel loop 
after loop fission. 

1. We extract the range TZ of the loop by examining the domain over which the 
loop iterates. 

2. We extract the accumulation functions used to update the lhs collections 
in the loop. For extracting the function Ai, we look at the statement in the 
loop where the lhs collection Oi is modified. We use interprocedural program 
slicing m, with this program point and the value of the element modified as 
the slicing criterion. 

3. For a given element of the RHS collection (with its 1- value and r-value), we 
determine the iteration(s) of the loop in which it can be accessed. Consider 
the example code in Figure Q Consider a pixel with the 1- value < t,num >, 
i.e., it is the num*^ pixel in the SatData[t] block. Suppose its r-value is < 
cl,c2,c3,c4,c5,lat,long >. From the code, it can be determined that this ele- 
ment will be and can only be accessed in the iteration < t, lat, long >. 

Formally, we denote it as 

IterV al{e = {< t,num >■, < cl,c2,c3,c4:,c5,lat,long >)) =< t, lat, long > 

While this information can be extracted easily from the loop in Figure El 
computing such information from a loop is a hard problem in general and a 
subject of future research. 

4. For a given element of the RHS collection (with its 1- value and r-value), we 
determine the 1-value of the lhs element which is updated using its value. For 
example, for the loop in Figure D for ^ RHS element with 1- value < t,num >, 
and r-value < cl, c2, c3, c4, c5, lat, long >, the 1- value of the LHS element updated 
using this is < lat, long >. 
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Formally, we denote it as 

OutV al{e = {< t,num >; < cl,c2,c3,c4,c5,lat,long >)) =< lot, long > 

This can be extracted by composing the subscript function for the lhs col- 
lections with the function IterVal. 

3.3 Storing Spatial Information 

To facilitate decisions about which disk blocks have elements that can contribute 
to a particular tile, the system stores additional information about each disk 
block in the meta-data associated with the dataset. Consider a disk block b 
which contains a number of elements. Explicitly storing the spatial coordinates 
associated with each of the elements as part of the meta-data will require very 
large additional storage and is clearly not practical. Instead, the range of the 
spatial coordinates of the elements in a disk block is described by a bounding 
box. 

A bounding box for a disk block is the minimal rectilinear section (described 
by the coordinates of the two extreme end-points) such that the spatial coordi- 
nates of each element in the disk block falls within this rectilinear section. 

Such bounding boxes can be computed and stored with the meta-data during 
a preprocessing phase when the data is distributed between the processors and 
disks. 

3.4 Loop Planning 

The following decisions need to made during the loop planning phase: 

— The size of lhs collection required on each processor and how it is tiled. 

~ The set of disk blocks from the RHS collection which need to read for each 

tile on each processor. 

The static declarations on the lhs collection can be used to decide the total 
size of the output required. Not all elements of this lhs space need to be updated 
on all processors. However, in the absence of any other analysis, we can simply 
replicate the lhs collections on all processors and perform global reduction after 
local reductions on all processors have been completed. 

The memory requirements of the replicated lhs space is typically higher 
than the available memory on each processor. Therefore, we need to divide the 
replicated lhs buffer into chunks that can be allocated on each processor’s main 
memory. We have so far used only a very simple strip mining strategy. We query 
the run-time system to determine the available memory that can be allocated 
on a given processor. Then, we divide the lhs space into blocks of that size. 
Formally, we divide the lhs domain into a set of smaller domains (called strips) 
{Si, S 2 , ■ ■ ■ , Sr}. Since each of the lhs collection of objects in the loop is accessed 
through the same subscript function, same strip mining is used for each of them. 
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We next decide which disk blocks need to be read for performing the updates 
on a given tile. A lhs tile is allocated only once. If elements of a particular disk 
block are required for updating multiple tiles, this disk block is read more than 
once. 

On a processor j and for a given rhs collection li, the bounding box of spatial 
coordinates for the disk block bijk is denoted by BB{bijk)- On a given processor 
j and for a given lhs strip I, the set of disk blocks which need to read is denoted 
by Lji. These sets are computed as follows: 

\ (BB{bijk) n Si) ^ (/)} 



3.5 Loop Execution 

The generic loop execution strategy is shown in Figure 0 The lhs tiles are 
allocated one at a time. For each lhs tile Si, the rhs disk blocks from the set 
Ljl are read successively. For each element e from a disk block, we need to 
determine: 

1. If this element is accessed in one of the iterations of the loop. If so, we need 
to know which iteration it is. 

2. The LHS element that this RHS element will contribute to, and if this lhs 
element belongs to the tile currently being processed. 

We use the function IterV al{e) computed earlier to map the rhs element to 
the iteration number and the function OutV al{e) to map a RHS element to the 
LHS element. 



For each lhs strip Si '. 

Execute on each Processor Pj-. 

Allocate and initialize strip Si for Oi , . . . , Om 
Foreach k £ Lji 

Read blocks bijk, i = 1, . . . , n from disks 
Foreach element e in bijk 
i = IterVal{e) 
o = OutV al{e) 

If(i £ 7^)A(o £ Si ) 

Evaluate functions Ai , ■ ■ ■ , Am 
Global reduction to finalize the values for Si 



Fig. 3. Loop Execution on Each Processor 



Though the above execution sequence achieves very high locality in disk 
accesses, it performs considerably higher computation than the original loop. 
This is because the mapping from the element e to the iteration number and the 
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LHS element needs to be evaluated and intersected with the range and tile in every 
iteration. In the next section, we describe a technique for hoisting conditional 
statements which eliminates almost all of the additional computation associated 
with the code shown in Figure 0 

4 Code Motion for Conditional Statements 

In this section, we present a technique which eliminates redundant conditional 
statements and merges the conditionals with loop headers or other conditionals 
wherever possible. 

Program Representation We consider only structured control flow, with if state- 
ments and loops. Within each control level, the definitions and uses of variables 
are linked together with def-use links. Since we are looking at def-use links within 
a single control level, each use of a variable is linked with at most one definition. 

Candidates for Code Motion The candidates for code motion in our framework 
are if statements. One common restriction in code hoisting frameworks like Par- 
tial Redundancy Elimination (PRE) [El and the existing work on code hoisting 
for conditionals 13 ED is that syntactically different expressions which may have 
the same value are considered different candidates. We remove this restriction 
by following def-use links and considering multiple views of the expressions in 
conditionals and loops. 

To motivate this, consider two conditional statements, one of which is en- 
closed in another. Let the outer condition be x > 2 and let the inner condition 
be y > 3. Syntactically, these are different expressions and therefore, it appears 
that both of them must be evaluated. However, by seeing the definitions of x 
and y that reach these conditional statements, we may be able to relate them. 
Suppose that x is defined as x = z -3 and y is defined as y = z - 2. By sub- 
stituting the definitions of x and y in the expressions, the conditions become z 
- 3 > 2 and z - 2 > 3, which are identical. 

We define a view of a candidate for code motion as follows. Starting from 
the conditional, we do a forward substitution of the definition of zero or more 
variables occurring in the conditional. This process may be repeated if new vari- 
ables are introduced in the expression after forward substitution is performed. 
By performing every distinct subset of the set of the all possible forward sub- 
stitutions, a distinct view of the candidate is obtained. Since we are considering 
def-use within a single control level, there is at most one reaching definition of a 
use of a variable. This significantly simplifies the forward substitution process. 

Views of a loop header are created in a similar fashion. Forward substitution 
is not done for any variable, including the induction variable, which may be 
modified in the loop. 

Phase P. Downward Propagation In the first phase, we propagate dominating 
eonstraints down the levels, and eliminate any conditional which may be redun- 
dant. Consider any loop header or conditional statement. The range of the loop 
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or the condition imposes a constraint for values of variables or expression in the 
control blocks enclosed within. As described previously, we compute the different 
views of the constraints by performing a different set of forward substitutions. By 
composing the different views of the loop headers and conditionals statements, 
we get different views of the dominating constraints. 

Consider any conditional statement for which the different views of the dom- 
inating constraints are available. By comparing the different views of this con- 
ditional with the different views of dominating constraints, we determine if this 
conditional is redundant. A redundant conditional is simple removed and the 
statements enclosed inside it are merged with the control block in which the 
conditional statement was initially placed. 

Phase II: Upward Propagation After the redundant conditionals have been elim- 
inated, we consider if any of the conditionals can be folded into any of the con- 
ditionals or loops enclosing it. The following steps are used in the process. We 
compute all the views of the conditional which is the candidate for hoisting. 

Consider any statement which dominates the conditional. We compute two 
terms: anticipahility of the candidate at that statement, and anticipable views. 
The candidate is anticipable at its original location and all views of the candidate 
computed originally are anticipable views. 

The candidate is considered anticipable at the beginning of a statement if 
it is anticipable at the end of the statement and any assignment made in the 
statement is not live at the end of the conditional. This reason behind this 
condition is as follows. A statement can be folded inside the conditional only if 
the values computed in it are used inside the conditional only. To compute the 
set of anticipable views at the beginning of a statement, we consider two cases: 

Case 1. If the variable assigned in the statement does not influence the expres- 
sion inside the conditional, all the views anticipable at the end of the 
statement are anticipable at the beginning of the statement. 

Case 2. Otherwise, let the variable assigned in this statement be v. From the set 
of views anticipable at the end of the statement, we exclude the views 
in which the definition of v at this statement is not forward substituted. 

Now, consider any conditional or loop which encloses the original candidate 
for placement, and let this candidate be anticipable at the beginning of the first 
statement enclosed in the conditional or loop. We compare all the views of this 
conditional or loop against all anticipable views of the candidate for placement. 
If either the left-hand-side or the right-hand-side of the expression are identical 
or separated by a constant, we fold in the candidate into this conditional or 
loop. 

5 Experimental Results 

In this section we present results from the experiments we conducted to demon- 
strate the effectiveness of our execution model. We also present preliminary ev- 
idence of the benefits from conditional hoisting optimization. We used a cluster 
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of 400 MHz Pentium II based computers connected by a gigabit switch. Each 
node has 256 MB of main memory and 18 GB of local disk. We ran experiments 
using 1, 2, 4 and 8 nodes of the cluster. 

The application we use for our experiments closely matches the code pre- 
sented in Figure n and is referred to as sat in this section. We generated code 
for the satellite template the compilation strategy described in this paper. This 
version is referred to as the sat -comp version. We also had access to a version 
of the code developed by customizing the runtime system Active Data Reposi- 
tory (ADR) by hand. This version is referred to as the sat -manual version. We 
further created two more versions. The version sat -opt has the code hoisting 
optimization applied by hand. The version sat -naive is created to measure the 
performance using a naive compilation strategy. This naive compilation strategy 
is based upon an execution model we used in our earlier work for the regular 
codes m 

The data for the satellite application we used is approximately 2.7 gigabytes. 
This corresponds to part of the data generated over a period of 2 months, and 
only contains data for bands 1 and 2, out of the 5 available for the particular 
satellite. The data spawns the entire surface of the planet over that period of 
time. The processing performed by the application consists of generating a com- 
posite image of the earth approximately from latitude 0 to latitude 78 north 
and from longitude 0 to longitude 78 east over the entire 2 month period. This 
involves composing over about 1/8 of the available data and represents an area 
that covers almost all of Europe, northern Africa, the Middle East and almost 
half of Asia. The output of the application is a 313 x 313 picture of the surface 
for the corresponding region. 

FigureScompares three versions of sat: sat-comp, sat-opt and sat -manual. 
The difference between the execution times of sat-comp and sat-opt shows the 
impact of eliminating redundant conditionals and hoisting others. The improve- 
ment in the execution time by performing this optimization is consistently be- 
tween 29% and 31% on 1, 2, 4, and 8 processor configurations. This is a significant 
improvement considering that a relatively simple optimization is applied. 

The sat-opt version is the best performance we expect from the compilation 
technique we have. Comparing the execution times from this version against a 
hand generated version show us how close the compiler generated version can be 
to hand customization. The versions sat-opt and sat -manual are significantly 
different in terms of the implementation. The hand customized code has been 
carefully optimized to avoid all unnecessary computations by only traversing 
the parts of each disk block that are effectively part of the output. The compiler 
generated code will traverse all the points of the data blocks. However, our 
proposed optimization is effective in hoisting the conditionals within the loop 
to the outside, therefore minimizing that extra computation. Our experiments 
show that after optimizations, the compiler is consistently around 18 to 20% 
slower than the hand customized version. 

Figure El shows the execution time if the execution strategy used for regular 
data intensive applications is applied for this code (the sat -naive version). 
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□ sat-comp 
■ sat-opt 

□ sat-manual 



Fig. 4. Satellite: Comparing sat-comp, sat-opt, and sat-manual versions 




Fig. 5. Performance of the naive implementation of the satellite template 



In this strategy, each input disk block is read like the strategy proposed in 
this paper. However, rather than iterating over the elements and mapping each 
element to an iteration of the loop, the bounding box of the disk block is mapped 
into a portion of the iteration space. Then the code is executed for this iteration 
space. 

As can be seen from Figure 0 the performance of this version is very poor 
and the execution times are almost two orders of magnitudes higher than the 
other versions. The reason is that this code iterates over a very large iteration 
space for each disk block and checks whether or not there is input data for each 
point in the domain. Due to the nature of the problem, the blocks towards the 
poles of the planet will spawn a very big area over the globe, which leads to a 
huge number of iterations. Clearly, the execution strategy for regular codes is 
not applicable for an application like sat. 
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6 Related Work 

The work presented in this paper is part of our continuing work on compiling data 
intensive applications dO] Our previous work did not handle applications with 
sparse accesses and datasets with spatial coordinates. This paper has two main 
original contributions beyond our previous publications. First, we have presented 
a new execution strategy for sparse accesses which is significantly different from 
the execution strategy for dense accesses presented previously M- Second, we 
have presented a technique for code motion of conditionals. 

Our code execution model has several similarities to the data-centric local- 
ity transformations proposed by Pingali et al. m- We fetch a data-chunk or 
shackle from a lower level in the memory hierarchy and perform the iterations 
from the loop which use elements from this data-chunk. We have focused on 
applications where no computations may be performed as part of many itera- 
tions from the original loop. So, instead of following the same loop pattern and 
inserting conditionals to see if the data accessed in the iteration belongs to the 
current data-chunk, we compute a mapping function from elements to iterations 
and iterate over the data elements. To facilitate this, we simplify the problem by 
performing loop fission, so that all collections on the right-hand-side are accessed 
with the same subscript function. 

Several other researchers have focused on removing fully or partially redun- 
dant conditionals from code. Mueller and Whalley have proposed analysis within 
a single loop-nest m and Bodik, Gupta, and Soffa perform demand-driven inter- 
procedural analysis |S| . Our method is more aggressive in the sense we associate 
the definitions of variables involved in the conditionals and loop headers. This 
allows us to consider conditionals that are syntactically different. Our method 
is also more restricted than these previously proposed approaches in the sense 
that we do not consider partially redundant conditionals and do not restructure 
the control flow to eliminate more conditionals. Many other researchers have 
presented techniques to detect the equality or implies relationship between con- 
ditionals, which are powerful enough to take care of syntactic differences between 

expressions 0,IIS1ES] 

Our work on providing high-level support for data intensive computing can 
be considered as developing an out-of-core Java compiler. Compiler optimiza- 
tions for improving I/O accesses have been considered by several projects. The 
PASSION project at Northwestern University has considered several different 
optimizations for improving locality in out-of-core applications Pinj. Some of 
these optimizations have also been implemented as part of the Fortran D com- 
pilation system’s support for out-of-core applications 1^. Mowry et al. have 
shown how a compiler can generate prefetching hints for improving the perfor- 
mance of a virtual memory system uni- These projects have concentrated on 
relatively simple stencil computations written in Fortran. Besides the use of an 
object-oriented language, our work is significantly different in the class of appli- 
cations we focus on. Our technique for loop execution is particularly targeted 
towards reduction operations, whereas previous work has concentrated on stencil 
computations. 
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7 Conclusions and Future Work 

Processing and analyzing large volumes of data plays an increasingly important 
role in many domains of scientific research. We have developed a compiler which 
processes data intensive applications written in a dialect of Java and compiles 
them for efficient execution on cluster of workstations or distributed memory 
machines. In this paper, we focus on data intensive applications with two im- 
portant properties: 1) data elements have spatial coordinates associated with 
them and the distribution of the data is not regular with respect to these coor- 
dinates, and 2) the application processes only a subset of the available data on 
the basis of spatial coordinates. We have presented a general compilation model 
for this class of applications which achieves high locality in disk accesses. We 
have also outlined a technique for hoisting conditionals and removing redundant 
conditionals that further achieves efficiency in execution of such compiled codes. 

Our preliminary experimental results show that the performance from our 
proposed execution strategy is nearly two orders of magnitude better than a 
naive strategy. Further, up to 30% improvement in performance is observed by 
applying the technique for hoisting conditionals. 
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Abstract. In translating HPF programs, a compiler has to generate 
local iteration and communication sets. Apart from local enumeration, 
local storage compression is an issue, because in HPF array alignment 
functions can introduce local storage inefficiencies. Storage compression, 
however, may not lead to serious performance penalties. A problem in 
semi-automatic translation is that a compiler should generate efficient 
code in all cases the user may expect efficient translation (no surprises). 
However, in current compilers this turns out to be not always true. A 
major cause for this inefficiencies is that compilers use the same fixed 
enumeration scheme in all cases. In this paper, we present an efficient 
dynamic local enumeration method, which always selects the optimal 
solution at run-time and has no need for code duplication. The method 
is compared with the PGI and the Adaptor compiler. 



Dynamic Selection of Enumeration Orders 

Once a data mapping function is given in an HPF program, we know exactly 
which element of an array is owned by which processor. However, the storage of 
all elements owned by a single processor still needs to be determined. When a 
compiler is to generate code for local iteration or communication sets, it needs to 
enumerate the local elements efficiently. Efficient in the sense that only a small 
overhead is allowed compared to the work inside the iteration space. Because in 
both phases local elements need to be referenced, storage and enumeration are 
closely linked to each other. An overview of the basic schemes for local storage 
and local enumeration is given in P|. 

In a cyclic(m) distribution the template data elements are distributed in 
blocks of size m in a round robin fashion. The relation between an array in- 
dex i and the row, column, and processor tuple (r,c,p) is given by the position 
equation PJ. To avoid inefficient memory storage, local storage is compressed 
by removing unused template elements. There are various compression tech- 
niques, each with their own compression factor for rows (Zlr) and columns (Ac). 
For a cyclic(m) distribution, the original (normalized) volume assignment is 
transformed into a two-deep loopnest. The outer loop enumerates the global 
indices of the starting points of the rows (order=row_wise) or the columns (or- 
der=column_wise), depending on the enumeration order as specified by order. 

In most compilers the order of enumeration is fixed. However, we have mofi- 
fied the method outlined in such that the generated code for both enumeration 
orders is identical, by adding a parameter ‘order’ to the run-time routines that 

S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 355-OS3 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 



356 



Will Denissen and Henk J. Sips 



determine the local loop nests. The only penalty is that a small test must be per- 
formed inside the run-time function, that calculates the length of the inner loop 
for both enumeration orders. The test comprises the evaluation and comparison 
of the following expressions: order = row_wise if Rext / < Cext l^c and order 
= column_wise if Rext/^'^ > Cextj^c., where R^xt and Cf^xt are the number of 
rows and columns, respectively. 



Measurements 

We have implemented the modified method into the prototype HPF compiler 
developed in the Esprit PREPARE project |2! . The compiler can make an auto- 
matic choice between enumeration orders, but for comparison, we can also select 
one of the enumeration orders manually. We have compared the performance 
of our prototype compiler with the pghpf compiler from PGI and the Adaptor 
compiler from GMD. The parallel computer used is a Myrinet-based Pentium 
Pro cluster. All HPF compilers use the same MPI library. 

The local enumeration performance is tested using a simple distributed array 
initialization, i.e. A{i) = 1. The program template is shown below. To generate 
a test program from the template, the template name and the appropriate value 
of the D coordinate is needed. The test program can be derived by removing 
the !*! comment prefix form each line which contains the appropriate value of 
D. Lines containing a !*! comment prefix but none of the space coordinates can 
simply be removed. 

We are mainly interested in the performance when 1) different alignments, 
2) different kinds of templates, and 3) different enumerations (row-wise versus 
column-wise) are used. We distinguish four kinds of templates: Dl\ block dis- 
tributed template, when there is only one row in the template; D2: hlock-like 
distributed template, when there is in the template only a small amount of rows 
compared to the number of columns; D3: cyclic-like distributed template, when 
there is in the template only a small amount of columns compared to the number 
of rows in the local data layout; DA: cyclic distributed template, when there is 
only one column in the template. 

PROGRAM initlD 

INTEGER, parameter :: N=100000,M=N/ (4*NUMBER_0F_PR0CESS0RS() ) , imaxa=1000 

INTEGER :: i,iter,l,u,s 

REAL :: A(N) ,timl,tim2,time_us 

!HPF$ PROCESSORS P(NUMBER_OF_PROCESSORS()) 

!D 1, 2, 3, 4!!HPF$ TEMPLATE T(N) 

!D 5, 6, 7, 8!!HPF$ TEMPLATE T(N+2) 

!D 9,10,11,12! !HPF$ TEMPLATE T(3*N+7) 

ID 1, 5, 9 !!HPF$ DISTRIBUTE T(BL0CK ) ONTO P 

ID 2, 6, 10 !!HPF$ DISTRIBUTE T(CYCLIC(M)) ONTO P I block-like 

ID 3, 7, 11 !!HPF$ DISTRIBUTE T(CYCLIC(4)) ONTO P I cyclic-like 

ID 4, 8, 12 !!HPF$ DISTRIBUTE T(CYCLIC ) ONTO P 

ID 1, 2, 3, 4!!HPF$ ALIGN (i) WITH T( i ) :: A 

ID 5, 6, 7, 8!!HPF$ ALIGN (i) WITH T( i+2) :: A 
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!D 9,10,11,12! !HPF$ ALIGN (i) WITH T(3*i+7) :: A 
timl = time_us() 

DO iter = 1 , imaxa 

CALL values (N, iter ,l,u, s) 

FORALL (i=l:u:s) 

A(i) = 1.0 
END FORALL 
END DO 

tim2 = time_us() 
print tim2-timl 
END PROGRAM 

The measured results in seconds are given in the table below. The column 
labeled with np shows the number of processors. The column labeled Tkind gives 
the kind of template the array was aligned with. The alignment itself is given 
in the second row. For each of these alignments, three columns are given for the 
PRE-HPF compiler, labeled {bc-Col,bcjrow,bl/cy). The column labeled bcjzol 
gives the timings of a column-wise enumeration with a double loop. The column 
labeled bcjrow gives the timings of a row- wise enumeration with a double loop. 
The third column labeled with bl/cy gives the timings of a block enumeration 
when the template is block distributed and the timings of a cyclic enumeration 
when the template is cyclic distributed, respectively. The columns labeled with 
pgi and gmd are the timings of the PGI-HPF and GMD-HPF generated codes, 
respectively. The blank table entries denote that the program did not compile 
or crashed at run-time. The gray table entries denote the best timings. 

We will first concentrate on the timings of the PRE-HPF compiler. When 
looking at the results, many columns are the same. For instance, all timings of the 
direct alignment ‘Align A(i) with T(i)’ aie almost the same as the timings of the 
shifted alignment ‘Align A(i) with T(i+2)’, as expected. The strided alignment 
‘Align A(i) with T(3i+7)’ however gives totally different timings for the white 
columns of the block-like, cyclic-like, and cyclic distributions. This is a result of 
the fact that the stride is three and in worst case, the compiler might need to 
check three times as much rows or columns when enumerating bc-col or bcjrow. 
Whether the compiler knows if a distribution is block or cyclic does not matter 
for the performance, as long as the compiler enumerates bcjrow for a block 
distribution and bcjcol for a cyclic distribution. In fact, this is no surprise because 
the generated outer loop will only loop once and hence does not generate extra 
loop overhead. On the other hand, when the inner loop would only loop once, 
the worst-case loop overhead will be measured, as shown in the columns marked 
with the ellipses. In the PRE-HPF compiler, row wise storage compression is 
used followed by a column wise compression (tile- wise compression). 

The generated code for cyclic{m) enumeration allows a selection of bc-col or 
bcjrow enumeration at run-time. The run-time system then always selects the 
largest loop as the inner loop. This yields the best performance, independent of 
the alignment and the distribution, except for the distribution block-like. For 
the block-like distribution, the best timings are shared between the PGI-HPF 
compiler and the PRE-HPF compiler. For a few processors, the PGI-HPF com- 
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piler performs best, because the local elements are stored row- wise and therefore 
the efficient stride-one memory access occurs. The PRE-HPF compiler timings 
correspond to a less efficient strided memory access, because it uses a fixed 
column-wise storage of local elements. If we would have selected a row-wise 
storage for the local elements, the timings of the (bcjrow, block-like) columns 
can be swapped with the timings of the {bc-col, cyclic-like) columns. The PRE- 
HPF compiler then outperforms the PGI-HPF compiler for all three alignments. 
Looking at the PGI-HPF timings, we conclude that it uses a fixed row-wise enu- 
meration scheme and the local elements are stored row- wise. This results in the 
same timings for a block kind of template as for the PRE-HPF compiler. Bet- 
ter timings are observed for the block- like templates, due to the row- wise local 
storage. If the PRE-HPF compiler also selects a row-wise storage, the above men- 
tioned columns can be swapped and it then outperforms the PGI-HPF compiler 
by a factor between 1.5 and 4. Slow execution times occur for templates where 
the PGI-HPF compiler should have switched over to column-wise enumeration, 
like in the cyclic-like templates. The PGI-HPF timings of a cyclic template are 
strongly dependent on the alignment used. It varies from twice as large for the 
direct alignment up to 40 times as large for the strided alignment. 
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Abstract. Writing correct and efficient programs for parallel comput- 
ers remains a challenging task, even after some decades of research in 
this area. One way to generate parallel programs is to write sequential 
programs and let the compiler handle the details of extracting paral- 
lelism. LooPo is an automatic parallelizer that extracts parallelism from 
sequential loop nests by transformations in the polyhedron model. The 
generation of code from these transformed programs is an important 
step. We report on problems met during code generation for HPF, and 
existing methods that can be used to reduce some of these problems. 



1 Introduction 

Writing correct and efficient programs for parallel computers is still a challenging 
task, even after several decades of research in this area. Basically, there are 
two major approaches: one is to develop parallel programming paradigms and 
languages which try to simplify the development of parallel programs (e.g., data- 
parallel programming EM and HPF |Hig97| ) , the other is to hide all parallelism 
from the programmer and let an automatically parallelizing compiler do the job. 

Parallel programming paradigms have the advantage that they tend to come 
with a straightforward compilation strategy. Optimizations are mostly performed 
based on a textual analysis of the code. This approach can yield good results for 
appropriately written programs. Modern HPF compilers are also able to detect 
parallelism automatically based on their code analysis. 

Automatic parallelization, on the other hand, often uses an abstract mathe- 
matical model to represent operations and dependences between them. Transfor- 
mations are then done in that model. A crucial step is the generation of actual 
code from the abstract description. 

Because of its generality, we use the polyhedron model jKea,9ti| ll^enHHj for 
parallelization. Parallel execution is then defined by an affine space-time mapping 
that assigns (virtual) processor and time coordinates to each iteration. Our goal 
is then to feed the resulting loop nest with explicit parallel directives to an HPF 
compiler. The problem here is that transformations in the polyhedron model can, 
in general, lead to code that cannot be handled efficiently by the HPF compiler. 
In the following section, we point to some key problems that occur during this 
phase. 
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2 Problems and Solutions 

The first step in the generation of loop programs from a set of affine conditions 
is the scanning of the index space; several methods have been proposed for this 
task IlkPfiM kIbWhsI IQITOI . However, the resulting program, may contain 
array accesses that cannot be handled efficiently by an HPF compiler. 

This can be partly avoided by converting to single assignment (SA) form. 
This transformation jFea.flll ICoh99j is often used to increase the amount of con- 
currency that can be exploited by the compiler (since, in this form, only true 
dependences have to be preserved). Converting to SA form after loop skewing 
- which is proposed by Collard in nsn - yields relatively simple index func- 
tions: index functions on the left-hand side of an assignment are given by the 
surrounding loop indices, and index functions on the right-hand side (RHS) are 
simplified because uniform dependences lead to simple numerical offsets and, 
thus, to simple shifts that can be detected and handled well by HPF compilers. 
However, there are three points that cause new problems: 

1. SA form in its simple form is extremely memory-consuming. 

2. Conversion to SA form may lead to the introduction of so-called 4>-functions 
that are used to reconstruct the flow of data. 

3. Array occurrences on the RHS of a statement may still be too complex for 
the HPF compiler in the case of non-uniform dependences, which may again 
lead to serialized load communications. 

The first point is addressed by Lefebvre and Feautrier \nm . Basically, they 
introduce modulo operators in array subscripts that cut down the size of the 
array introduced by SA conversion to the length of the longest dependence for a 
given write access. The resulting arrays are then partially renamed, using a graph 
coloring algorithm with an interference relation (write accesses may conflict for 
the same read) as edge relation. Modulo operators are very hard to analyze, but 
introducing them for array dimensions that correspond to loops that enumerate 
time steps (in which the array is not distributed) may still work, while spatial 
array dimensions should remain without modulo operators. In the distributed 
memory setting, this optimization should generally not be applied directly, since 
this would result in some processors owning the data read and written by others. 
The overall memory consumption may be smaller than that of the original array 
but, on the other hand, buffers and communication statements for non-local data 
have to be introduced. One solution is to produce a tiled program and not use 
the modulo operator in distributed dimensions. 

<()-functions may be necessary in SA form due to several possible sources 
of a single read since, in SA form, each statement writes to a separate, newly 
introduced array. <()-functions select a specific source for a certain access; thus, 
their function is similar to the ?-operator of C. In the case of selections based on 
affine conditions, c()-functions can be implemented by copy operations executed 
for the corresponding part of the index, which can be scanned by standard 
methods. Yet, even this implementation introduces memory copies that can be 
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avoided by generating code for the computation statement for some combinations 
of possible sources directly, trading code size for efficiency. 

For non-affine conditions, additional data structures become necessary to 
manage the information of which loop iteration performs the last write to a cer- 
tain original array cell. Cohen offers a method for handling these accesses 

and also addresses optimization issues for shared memory; in the distributed 
memory setting, however, generation of efficient management code becomes more 
complicated, since the information has to be propagated to all processors. 

A general approach for communication generation that can also be used to 
vectorize messages for affine index expressions is described by Coelho |ACKI9 It| : 
send and receive sets are computed, based on the meet of the data owned by 
a given sending processor with the read accesses of the other processors - all 
given by affine constraints. The corresponding array index set is then scanned 
to pack the data into a buffer; an approach that has also been taken by the 
dHPF compiler |AM(A)iS| . In a first attempt to evaluate this generalized message 
vectorization using portable techniques, we implemented HPF_L0CAL routines for 
packing and sending data needed on a remote processor and copying local data 
to the corresponding array produced by single-assignment conversion. However, 
our first performance results with this compiler-independent approach were not 
encouraging due to very high overhead in loop control and memory copies. 

Communication generation may also be simplified by communicating a su- 
perset of the data needed. We are currently examining this option. Another 
point of improvement that we are currently considering is to recompute data 
locally instead of creating communication, if the cost of recomputing (and the 
communication for this recomputation) is smaller than the cost for the straight- 
forward communication. Of course, this scheme cannot be used to implement 
purely pipelined computation, but may be useful in a context where overlap- 
ping of computation with communication (see below) and/or communication of 
a superset of data can be used to improve overall performance. 

Overlapping of communication with computation is also an important opti- 
mization technique. Here, it may be useful to fill the temporaries that implement 
the sources of a read access directly after computing the corresponding value. 
Data transfers needed for these statements may then be done using non-blocking 
communication, and the operations, for which the computations at a given time 
step must wait, are given directly by an affine function. Although our preliminary 
tests did not yield positive results, we are still pursuing this technique. 

A further issue is the size and performance of the code generated by a poly- 
hedron scan. Quillere and Rajopadhye introduce a scanning method that 

separates the polyhedra to be scanned such that unnecessary IF statements in- 
side a loop - which cause much run-time overhead - are completely removed. 
Although this method still yields very large code in the worst case, it allows to 
trade between performance and code size by adjusting the dimension in which the 
code separation should start, similar to the Omega code generator [KPF!,94| . So, 
the question is: which separation depth should be used for which statements? A 
practical heuristics may be to separate the loops surrounding computation state- 
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ments on the first level, scan the loops implementing (/^functions separately, and 
replicate these loop nests at the beginning of time loops. 



3 Conclusions 

We have learned that straightforward output of general skewed loop nests leads 
to very inefficient code. This code can be optimized by converting to SA form 
and leveraging elaborate scanning methods. Yet, both of these methods also have 
drawbacks that need to be weighed against their benefits. There is still room left 
for optimization by tuning the variable factors of these techniques. Code size and 
overhead due to complicated control structures have to be considered carefully. 
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Abstract. CFL (Communication Fusion Library) is a C++ library for 
MPI programmers. It uses overloading to distinguish private variables 
from replicated, shared variables, and automatically introduces MPI 
communication to keep such replicated data consistent. This paper con- 
cerns a simple but surprisingly effective technique which improves per- 
formance substantially: CFL operators are executed lazily in order to 
expose opportunities for run-time, context-dependent, optimisation such 
as message aggregation and operator fusion. We evaluate the idea in the 
context of a large-scale simulation of oceanic plankton ecology. The re- 
sults demonstrate the software engineering benefits that accrue from the 
CFL abstraction and show that performance close to that of manually 
optimised code can be achieved automatically in many cases. 



1 Introduction 

In this paper we describe an experimental abstract data type for representing 
shared variables in SPMD-style MPI programs. The operators of the abstract 
data type have a simple and intuitive semantics and hide any required communi- 
cation from the user. Although there are some interesting issues in the design of 
the library, the main contribution of this paper is to show how lazy evaluation is 
used to expose run-time optimisations which may be difficult, or even impossi- 
ble, to spot using conventional compile-time analyses. Figure ^ shows the basic 
idea. The CFL class library can be freely mixed with standard MPI operations 
in a SPMD application. C++ operator overloading is used to simplify the API 
by using existing operator symbols (e.g. +, *, += etc.) although these may have 
a parallel reading when the target of an assignment is another shared variable. 

FigureOshows how this code could be optimised manually by fusing the three 
reductions. This paper shows how our implementation automatically achieves 
this behaviour. 

Related work There has been a lot of interesting work using C++ to simplify par- 
allel programming |p. The idea of delaying library function evaluation to create 
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double X, y, z, a, b, c, d, e, f, n; 
for(i=0;i<N0_ITER;i++) ■[ 
a = . . .new value for a. . . ; 
MPI_Allreduce(&a, fed, 1, MPI_DOUBLE, 
MPI_SUM, MPI_C0MH_W0RLD) ; 

X += d; 

b = . . .new value for b. . . ; 
MPI_Allreduce(&b, fee, 1, MPI_DOUBLE, 
MPI_SUM, MPI_C0MH_W0RLD) ; 

y -= e; 

c = . . .new value for c. . . ; 
MPI_Allreduce(&c, &f, 1, MPI_DOUBLE, 
MPI_SUM, MPI_C0MM_W0RLD) ; 

z += f ; 

n += X + y + z ; 

> 



CFL_Double x, y, z; 




double n; 




for(i=0;i<N0_ITER;i++) 


{ 


X += . . .new value for 


a. . . ; 


y -= . . .new value for 


b. . . ; 


z += . . .new value for 


c . . . ; 


n += X + y + z ; 




} 





Fig. 1. In the MPI code (left), x, 
y and z are replicated and updated 
explicitly using MPI. Using CFL, 
(above), replicated shared variables 
are declared as CFLJDouble. Arith- 
metic and assignment operators are 
overloaded to implement any com- 
munication needed to keep each 
processor’s data up to date. 



optimisation opportunities is fairly well-known (eg. p|, j2]). Compile-time tech- 
niques are applicable, but require interprocedural analysis. An alternative is for 
the programmer to build an execution plan explicitly (eg. 0). Weak consistency 
in an application-specific cache coherency protocol creates similar optimisation 
opportunities. 

Implementation Space precludes a full explanation of how CFL works. As com- 
putation proceeds, we accumulate deferred updates to each CFLJDouble, together 
with a list of deferred communications. Execution is forced when a CFLJDouble 
is assigned to a local variable, appears in a conditional, or is passed as a double 
parameter. Some care is needed to handle expression intermediate values prop- 
erly. Note that the semantics of a statement like x=y+a depends crucially on 
which of X, y and a are doubles and which are CFL_Doubles. 



2 Experimental Evaluation and Application Experience 

Our performance results are from dedicated runs on a Fujitsu AP3000, a distrib- 
uted-memory MPP comprising 32 300MHz Sparc Ultra II processors each with 
1 28MB RAM, linked by a fast proprietary interconnect which is accessed directly 
by the vendor’s MPI implementation. 

We present two experiments to help evaluate the work. The first is the arti- 
ficial application shown at the beginning of the paper. For the local calculations 
to compute a, b and c we chose trivial arithmetic operations, so the computation 
is entirely dominated by communication. The results are presented in Table E 
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Fig. 2. Manually optimised 
implementation of example 
from Figure Q The com- 
munications can be delayed, 
but have to occur in order 
to produce the correct value 
for the non-shared variable 
n. At that point, the three 
scalar MPI_Allreduce opera- 
tions can be combined into a 
single MPI_Allreduce operat- 
ing on a three-element vector. 



Table 1. With the contrived 
communication-bound example of Fig- 
ures ^ ( “Unoptimised” and “Automatic”) 
and Q (“Manual”), optimisation reduces 
three scalar MPI_Allreduce calls to one 
operating on a three-element vector. Due 
to run-time overheads, CFL does not 
quite realise a 3-fold speedup. With one 
processor, the CFP overheads dominate 
(the redundant MPI calls were still 
executed) . 



A full-scale application: Ocean Plankton Ecology The second experiment is a 
full-scale application which models plankton ecology in the upper ocean using 
the Lagrangian Ensemble method |Ej (the “ZB” configuration) . 

Because of the modular nature of the application, environment variables such 
as concentrations of nutrients are assigned in one module, and frequently not 
used until a later module. The relatively large distance between the producer 
and consumer provides good scope for message aggregation. 

The results are shown in Figure El With a larger 3.2M-particle problem with 
32 processors the CFP performance gain is around 12%, improving efficiency 
from 87% to 97%. 



Procs 


Execution time(s) 




Unoptimised 


Manual 


Automatic 


1 


1.11 


1.12 


5.02 


2 


49.9 


15.0 


17.9 


4 


48.8 


14.2 


17.0 


8 


37.9 


11.3 


12.5 


16 


32.6 


9.4 


10.5 


32 


20.0 


6.0 


6.4 



double X, y, z, sbuf [3] , rbuf [3] , n; 
for(i=0;i<N0_ITER;i++) { 
sbuf [0] = . . .new value for a. . . ; 
sbuf [1] = . . .new value for b. . . ; 
sbuf [2] = . . .new value for c. . . ; 
MPI_Allreduce(sbuf , rbuf, 3, MPI_D0UBLE, 
MPI_SUM, MPI_C0MM_W0RLD) ; 

X += rbuf [0] ; 
y -= rbuf [1] ; 
z += rbuf [2] ; 
n += X + y + z; 

> 



3 Conclusions 

This paper presents a simple idea, which works remarkably well. We have built a 
small experimental library on top of MPI which enables shared scalar variables in 
parallel SPMD-style programs to be represented as an abstract data type. Using 
operator overloading, the familiar arithmetic operators can be used, although 
the operators may have a parallel reading when the target of an assignment is 
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Table 2. Ocean Plankton Ecology application (320,000 particles). Synchronisa- 
tions are reduced from 27 to 3 per time step with both manual and automatic 
optimisation. CFL overheads are too small to measure (the small superlinear 
speedup is due to cache effects, which probably also account for the good CFP 
performance on 32 processors). 



Procs 


Unoptimised 


Manual 


Automatic 


CFL 

benefit 


secs 


speedup 


secs 


speedup 


secs 


speedup 


1 


3721 


1.00 


3721 


1.00 


3738 


0.995 




2 


1805 


2.06 


1779 


2.09 


1790 


2.08 


1% 


4 


934 


3.98 


869 


4.28 


866 


4.30 


8% 


8 


491 


7.58 


433 


8.59 


418 


8.90 


17% 


16 


317 


11.74 


244 


15.25 


257 


14.48 


23% 


32 


292 


12.74 


191 


19.48 


182 


20.45 


60.5% 



another shared variable. We have shown how delayed evaluation can be used to 
aggregate the reduction messages needed to keep such variables consistent, with 
potentially substantial performance benefits. This is demonstrated across the 
modules of a large oceanography application. The approach avoids reliance on 
sophisticated compile-time analyses and exploits opportunities which arise from 
dynamic data dependencies. Using a contrived test program and a realistic case 
study we have demonstrated very pleasing performance improvements. 

Extending the library to support reductions of shared arrays should be str- 
aightforward. Extending the idea to other communication patterns presents in- 
teresting challenges which we are investigating 
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Abstract. We consider a generalization of the SPMD programming 
model for distributed memory machines based on orthogonal processor 
groups. In this model different partitions of the processors into disjoint 
processor groups exist and can be used simultaneously in a single parallel 
implementation. Set operations on orthogonal groups are used to express 
group-SPMD computations on different partitions of the processors. The 
set operations are implemented in MPI. 



1 Introduction 

Many applications from scientific computing exhibit a two-dimensional compu- 
tational structure. Examples are matrix computations or grid-based computa- 
tions. A typical approach for parallel execution on a distributed memory machine 
(DMM) is to distribute the data and let each processor perform the computations 
for its local data. Because of the importance of obtaining efficient programs for 
grid-based computations, there are many approaches to support the development 
of efficient programs for novel architectures, see |2] for an overview. Reducing the 
communication overhead is one of the main concerns when developing programs 
for DMMs. Grid-based computations with strong locality, such that a compu- 
tation at a grid point needs only data from neighboring grid points, are best 
implemented with a blockwise data distribution resulting in point-to-point com- 
munication between neighboring processors. But many grid based algorithms 
exhibit a more diverse dependence pattern and the communication needed to re- 
alize the dependencies is affected strongly by the data distribution. To minimize 
the communication overhead, highly connected computations should reside on 
the same processor. 

We consider applications with a two-dimensional structure that exhibits de- 
pendencies in both the vertical and horizontal directions, in the sense that parts 
of the computation are column-oriented with similar computations and depen- 
dencies within the same column, while other parts of the computation are row- 
oriented. To realize an efficient implementation for pure horizontal or pure verti- 
cal computations, corresponding partitions of processors into disjoint groups are 
needed. The advantage of using group operations on disjoint processor groups is 
that collective communication operations performed on smaller processor groups 
lead to smaller execution times due to the logarithmic or linear dependence of 
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the communication times on the number of processors. For computations on 
changing or alternating directions, different partitions of the processors have to 
be used for different directions. Correspondingly, it is useful to arrange the pro- 
cessors in orthogonal partition structures so that each processor belongs to a 
different group for each of the partitions. Since different directions of compu- 
tation may be active at different points during the execution, it is suitable to 
use a data distribution that is not biased towards a specific direction. Examples 
are blockwise data distributions or data distributions that are cyclic in each 
direction. 

In this paper, we extend a double-cyclic data distribution to a processor grid 
with orthogonal processor groups, which means that for one set of processors ex- 
ecuting the program there simultaneously exist two different partitions into dis- 
joint groups of processors. This leads to programs in a group-SPMD computation 
model with different groups in different parts of the program. The concurrently 
executed SPMD-code includes collective communication within each group of a 
partition so that data transfers can be combined into one communication oper- 
ation. The processor group concept reduces the number of processors that have 
to participate in a collective communication operation, resulting in faster execu- 
tion. Our approach is not simply to construct new groups dynamically. Instead, 
we define a fixed set of useful orthogonal groups at the outset. As the execution 
proceeds, the program establishes one grouping, performs SPMD operations and 
collective communications, and then establishes the other grouping for the next 
step. Disjoint processor groups can be expressed in the communication library 
MPI. In addition, we use set operations for the concurrently executing processor 
partitions, which are also implemented in MPI. As an example, we consider an 
LU decomposition for which a double-cyclic distribution leads to a good load 
balance HE] This approach can also be used when combining task and data 
parallel computations Q- 

2 Describing Orthogonal Compntation Structures 

We consider application programs with an orthogonal internal computation 
structure. For the organization of the computations and the assignment to pro- 
cessors, we use the following abstraction: A parallel application program is com- 
posed of a set T of n one-processor tasks that are organized in a two-dimensional 
structure. The tasks are numbered with two-dimensional indices in the form Tij, 
i = 1, . . . , ni, j = 1, . . . , ri 2 , with n = rii * n 2 - Each single task T G T consists 
of a sequence of computations which may need data from other tasks or which 
may produce data needed by other tasks. 

The entire task program is expressed in an SPMD style with explicit con- 
structs for horizontal or vertical executions, which we call horizontal sections and 
vertical sections respectively. In horizontal sections a task Tij has interactions 
with a set of tasks {Tiji\j' = 1,...,U2,/ ^ j}. In vertical sections a task Ty 
has interactions with a set of tasks = 1, ...,ui,i' ^ i}. A single task Ty 

is composed of communication and computation commands. To indicate that a 
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task participates in an SPMD-like operation together with other tasks in the 
horizontal or vertical direction, respectively, we introduce specific commands 
with the following meaning: 

— verticaLsection(k) { statements } : 

Each task in column k executes statements in an SPMD-like way together 
with the other tasks in column fc; statements may contain computations as 
well as collective communication and reduction operations involving tasks 
Tifc, . . . , Tasks with j ^ k perform a skip-operation. 

— horizontal_section(k) { statements } : 

Similar to verticaLsection(k), but uses a horizontal organization. 

— vertical-section ( ) { statements } : 

Each task executes statements in an SPMD-like way together with the other 
tasks in the same column. A task in column k may perform computations 
as well as collective communication and reduction operations involving the 
other tasks T\k, ■ ■ ■ ,Tn-^^k in the same column. Computations in a specific 
column of the task array are executed in parallel with the other columns. 
Thus, vertical-section ( ) corresponds to a parallel loop over all columns k 
with each iteration executing verticaLsection(k). 

— horizontal-section( ) { statements } : 

Similar to verticaLsection( ), but uses a horizontal organization. 

3 Mapping to Orthogonal Processor Groups 

In order to exploit the potential parallelism of orthogonal computation struc- 
tures, we map the computations onto a two-dimensional processor grid for which 
we provide two different partitions that exist simultaneously and can be exploited 
to reduce the communication overhead. For the assignment of tasks to proces- 
sors, we use parameterized mappings similar to parameterized data distributions 
w which describe the data distribution for arrays of arbitrary dimension. This 
mechanism is used to define disjoint row and column groups such that each row 
and column of the task array is assigned to a single row and column group re- 
spectively. For the parameterized mapping, the processors are logically arranged 
in a two-dimensional processor grid that can be described by the number pi 
and p2 of processors in each dimension. A double-cyclic mapping of the two- 
dimensional task array to the processor grid is specified by blocksizes b\ and &2 
in each dimension, which determine the number of consecutive rows and columns 
that each processor obtains of each cyclic block. The row groups and the column 
groups are orthogonal processor groups. 

The mapping of the tasks to the processors defines the computations that 
each processor has to perform. Horizontal and vertical sections require the co- 
ordination of the participating processors. A verticaLsection(k) operation is per- 
formed by all processors in the corresponding column group. Similarly, a hori- 
zontal-section(k) operation is performed by all processors in the corresponding 
row group. A vertical-section() operation is performed by all processors, but each 
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processor exchanges information only with the processors in the same column 
group. A horizontal_section() is executed analogously. 

The realization is performed by using the communicator or group concepts of 
MPI. Since the parallel computation on the orthogonal groups is often embedded 
in a larger application, it is convenient to use the global ranks of the processors 
to direct the computations. Set operations are used to identify members of a 
processor group responsible for a specific row or column of the task array or to 
pick processors in the intersection of specific row or column groups. 

4 Example 

Figure 0 shows runtime results for a double-cyclic LU decomposition with col- 
umn pivoting on a Cray T3E-1200. The blocksize in each dimension has been set 
to 1. The diagrams compare two versions which are based on a data distribution 
on the global group of processors with an implementation resulting from the use 
of orthogonal groups. These groups are used for computing the pivot element on 
a single column group, for making the pivot row available to all processors by 
parallel executions on all column groups, for computing the elimination factors 
on a single column group, and for broadcasting the elimination factors in the row 
groups. The processor grid uses an equal number of processors in each dimen- 
sion, i.e., the row and column groups have equal size. For 24 and 72 processors, 
processor grids of size 4x6 and 8x9 respectively, are used. All versions result in 
a good load balance. The diagram shows that for a larger number of processors, 
the implementation with orthogonal processor groups shows the best runtime 
results. For the largest number of processors, the percentage improvement over 
the second best version is 25% and 15% respectively. 



LU decomposition for n = 2304 on Cray T3E 



LU decomposition for n = 3600 on Cray T3E 





processors 



processors 



Fig. 1. Runtimes for the LU decomposition on a Cray T3E-1200. 
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Abstract. This work presents compiler-based scheduling strategies for 
Java Mobile Agents. We analyze the program using annotations and 
data sizes. For the different strategies, the compiler produces the best 
schedule, taking dependence information into account. 



1 Introduction 

Java Mobile Agents are useful in distributed data mining applications. This work 
attempts to perform compiler based optimizations of Java mobile agents. Our 
compiler (called ajava) takes an annotated Java program as input and performs 
various optimizations. It identifies a schedule that carries minimum amount of 
data through the network and generates mobile agents. 



2 Motivating Example 

Distributed Data Mining applications can take advantage of Mobile agents. The 
following example accesses data distributed across three clients in the local net- 
work. The database tables used are Loans (5 records each of 20 bytes), Em- 
ployee(5000 records each of 40 bytes) and LoansAvailed (7000 records of 12 
bytes each) . In the example given in figure ^ under the constraints of flow de- 
pendence, statements SI, S2, S3 can be interchanged but S4 can occur only after 
SI, S2 and S3. Therefore the possible legal schedules are 

(SI , S2, S3, S4),(S1, S3, S2, S4 ),(S2,S1, S3, S4 ) 

(S2,S3,S1,S4), (S3,S1,S2,S4), (S3,S2,S1,S4) 

Among the six schedules only (S2,S3,S1,S4) carries the least amount of data. 
The amount of data moved around by this schedule is given below. 

Total Size of Employee table = 5000 *40 = 200000 6ytes 
Total Size of Loans table = 5 * 20 = 100 bytes 
Total Size of Loans Availed table = 7000 * 12 = 84000 bytes 
Total data carried = 100 -b 84100 -b 284100 = 368300 bytes 



* This work was partially supported by NSF grant #EIA 987135 
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//& (Employee, atp://stationB. ececs.nc.edu, 5000); 

//& (Loans, atp://stationA. ececs.uc.edu, 5); 

//& (LA, atp://stationC. ececs.uc.edu, 7000); 

51 Employee = getEmployeeRecords(); 

52 Loans = getLoansRecords(); 

53 LA = getLoansAvailedRecords(); 

54 resultSet = getResultSet(Employee,LA,"saZari/ > 10000 
and loanjno = 4”); 



Fig. 1. Scheduling of Mobile Agents 

3 Framework for Efficient Schedule Generation 

Our compiler framework uses annotations in the Java programs, which identify 
the distributed data variables and the host from which the data for them can 
be extracted. It also specifies the approximate size of the data that the variable 
would be carrying. The syntax of the annotation is 

//&: (Variable Name, Host Name, Approximate Size); 

Carrying Data Strategy (CDS): In this strategy, the mobile agent carries 
the data around and after getting all the necessary data, it will operate on them 
using the code at the source host. This strategy is applied if the following is true. 
\f{argi,arg 2 , .■■argn)\ > \argi\ + \arg 2 \ + ... + \argn\, 
where |x| indicates the data size. 

Let us consider an example that multiplies two polynomials that are dis- 
tributed (Figure 0. The size of the input polynomials as seen from the pseudo- 
code are 100 and 100. This implies that the resultant product will have a maxi- 
mum size of 10000 coefficients. Comparing the size of the input (100-1-100 = 200), 
the size of the result is enormous. In this case, the best approach, is to carry 
the data to the source and then compute the product at the source so that the 
amount of data carried will be only 200. 

Partial Evaluation Strategy (PES): In this strategy, the mobile agent uses 
the code to operate on the data that is available at a remote host and then 
carries only the result from there. This strategy is better when the data operated 
on is enormous or if it cannot be carried around due to security or proprietary 
reasons. In FigureH, the query that was evaluated was The list of employees with 
salary > 15000 and who have availed loan 5. The result set of this query is small 
after the join but the result set of The list of employees with salary > 15000 
alone is large. Therfore, Carrying Data Strategy will carry unnecessary data 
through the network before it finds that the result set is small. Since, S3 depends 
on S2 for getting the value of the loan number, it cannot be executed before S2. 
Therefore, the available schedules for this example are 
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//& (P[l], machinel.ececs.uc.edu, 100); 

//& (P[2], machine2.ececs.uc.edu, 100); 

51 machine[l]= machinel.ececs.uc.edu; 

52 machine[2]= machine2.ececs.uc.edu; 

// eetPoly returns the array after storing the coefficients 

53 for (i = 1 to 2) do 

54 P[i] = getPoly(machine[i]); 

// The function multiplyPolynomials, multiplies the polynomial 

55 Product = multiplyPolynomial(P[l],P[2]); 



Fig. 2. Carrying Data Strategy 



(S1,S2,S3,S4), (S2,S1,S3,S4), (S2,S3,S1,S4) 

The data carried by Carrying Data Strategy for the three different schedules 
are 36040 bytes, 32060 bytes and 52060 bytes. However, when Partial Evaluation 
Strategy is considered, the sizes of data moved around are 10020 bytes, 6040 
bytes and 26040 bytes respectively. This proves that Partial Evaluation Strategy 
is better if size of the expected result is lesser than the input. 



//& (Employee, atp://statiouB. ececs.uc.edu, 1000); 
//& (Loans, atp://stationA. ececs.uc.edu, 5); 

//& (LA, atp://stationC. ececs.uc.edu, 2000); 

51 Employee = getEmployeeRecords(sa/arj/ < 1000); 

52 Loans = getLoansRecords (name =" Computer Loan" )\ 

53 LA = getLoansAvailedRecords(loan = Loans. loan jno)\ 

54 resultSet =getResultSet(Employee, LA, "salary < 1000 

and loanjno = loan"); 



Fig. 3. Partial Evaluation Strategy 



4 Results 

Our framework was implemented using dikes, the Java compiler from IBM and 
the Aglets Software Development Kit from IBM, Japan. In the database exam- 
ples, we have assumed that the agent can carry the result of a query but cannot 
carry the entire database and make it persistent in some other location. We used 
a local network consisting of 3 Sun Ultra 5 machines and a Sun Ultra 1 machine 
for testing the framework. For each example, we made the compiler generate the 
best schedule for both CDS and PES strategies and compared the results. 
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In the tables |2| and 0 S refers to Source and Ml, M2 and M3 are three 
different machines where the data is distributed. Columns S-Ml etc., denote the 
size of data (including the size of agent code) carried from S to Ml in Kilobytes. 



Table 1. Result Set size (number of records) of the Individual Queries and the 
Join 



S.No 


Query 


Result Size 


Join Size 


1 


salary > 4000 


4362 


1540 


loanid = 4 


1763 


2 


salary > 15000 


2665 


2 


loanid = 5 


3 


3 


salary < 100 


13 


3 


loanid — 2 


1783 



Table □ gives the size of the result set for various queries for a version of 
the example given in figure |3 Table |2| gives the results for CDS and PES for 
various queries. From the table it is evident that the time taken by PES will be 
lesser than CDS, when the result size is far less than the input data size. Table 0 
illustrates the results of the CDS and PES strategy for the example given in 
figure 0 It is evident from the table that when the result size is large, CDS 
outperforms the PES. 



Table 2. Comparison of CDS and PES for the Query Example 



Query 


Carrying Data Strategy 


Partial Evaluation Strategy 


S-Ml 

(KB) 


M1-M2 

(KB) 


M2-M3 

(KB) 


M3-S 

(KB) 


Time 

sec 


S-Ml 

(KB) 


M1-M2 

(KB) 


M2-M3 

(KB) 


M3-S 

(KB) 


Time 

sec 


1 


1.3 


2.3 


239.5 


277 


80 


1.2 


2.2 


239.4 


73.9 


51 


2 


1.3 


2.3 


149.3 


149 


52 


1.2 


2.2 


149.2 


2.3 


23 


3 


1.3 


2.3 


3.2 


41 


45 


1.2 


2.2 


3.0 


2.3 


18 



Table 3. Comparison of CDS and PES for Polynomial Multiplication 



Polynomial Degree 


Carrying Data Strategy 


Partial Evaluation Strategy 


I 


II 


Result 


S-Ml 

(KB) 


Ml-S 

(KB) 


M2-S 

(KB) 


Total 

Size(KB) 


Time 

sec 


S-Ml 

(KB) 


M1-M2 

(KB) 


M2-S 

(KB) 


Total 

Size 


Time 

sec 


10 


10 


100 


2.4 


2.6 


2.6 


10.0 


2.3 


2.9 


3.1 


4.1 


10.2 


4.5 


10 


1000 


10000 


2.4 


2.7 


12.7 


20.3 


2.7 


2.9 


3.1 


103.3 


109.4 


7.2 


50 


1000 


50000 


2.4 


3.1 


12.7 


20.7 


2.9 


2.9 


3.5 


534.9 


541.5 


42.3 
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5 Related Work and Conclusion 

Traveler 0 and StratOSphere 0j support the creation of mobile agents and allow 
distributed processing across several hosts. Jaspal Subhlok et.al |3, present a 
solution for automatic selection of nodes for executing a performance critical 
parallel application in a shared network. A. Iqbal et.al ^ find the shortest path 
between the start node and the end node to get optimal migration sequence. This 
work is oriented towards reducing the distance traveled rather than minimizing 
the data carried. 

In contrast, this work shows how compiler based optimizations will help 
achieve efficient scheduling of the mobile agents. Compiler based analysis helps 
in reducing the amount of data carried through the network by identifying those 
statements that can be partially evaluated and those statements for which data 
can be carried. In addition, it can generate the most efficient schedule under the 
program dependence constraints. 
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Abstract. We are interested in the issues on the bytecode transforma- 
tion for performance improvements on programs. In this work, we focus 
on the aspect of our bytecode to bytecode optimizing system on the 
ability to optimize the performances of hardware stack machines. Two 
categories of the problem are considered. First, we consider the stack 
allocations for intra-procedural cases with a family of Java processors. 
We propose a mechanism to report an allocation scheme for a given size 
of stack allocation according to our cost model. Second, we also extend 
our framework for stack allocations to deal with inter-procedural cases. 
Our initial experimental test-bed is based on an ITRI-made Java proces- 
sor and Kaffe VM simulator|2|. Early experiments indicate our proposed 
methods are promising in speedup Java programs on Java processors 
with a fixed size of stack caches. 



1 Introduction 

In our research work, we are interested in investigating issues related to im- 
proving system performances via bytecode engineering. We are working on to 
develop a bytecode to bytecode optimizing system called JavaO. In this paper, 
we address the aspect of our bytecode to bytecode optimizing system on the 
ability to optimize the performances of hardware stack machines. Our system 
takes a bytecode program and returns an optimized bytecode program, which 
runs well for hardware stack machines. We consider stack frame allocations for 
Java processors with a fixed-size stack cache. Our work gives solutions for both 
intra-procedural and inter-procedural cases. 



2 Machine Architectures 

Our hardware model basically is a direct implementation of the frame activation 
allocations of a software JVM. In order to perform a direct hardware imple- 
mentation of the frame allocation and reference schemes, we have our hardware 
Java processor contain a stack cache, which all frame structures are created and 

* This paper is supported in part by NSC of Taiwan under grant no. NSC 89-2213-E- 
007-049, by MOE/NSC program for promoting academic excellence of universities 
under grant no. 89-E-EA0414, and by ITRI of Taiwan under grant no. G388029. 

S.P. Midkiff et al. (Eds.): LCPC 2000, LNCS 2017, pp. 377-023 2001. 

(c) Springer-Verlag Berlin Heidelberg 2001 



378 Jian-Zhi Wu and Jenq Kuen Lee 



destroyed on. ITRI Java processor 0) is an example belonging to this type of ma- 
chine. An implementation to make JVM a hardware machine but with a fixed 
size of stack cache will likely fall into this category. Besides, the Java processor 
with the limited size of a stack cache must have mechanisms like spill and fill to 
deal with the stack overflow and underflow problem. 



3 Reverse Object-Unfolding 

We proposed a concept called reverse object-unfolding to be applied for stack 
cache optimization. The method is to have the reversal effect of a well-known 
optimization scheme called structure or object unfolding. Our consideration is 
that we need to combine these two techniques together to have an effective 
allocation scheme for a fixed size stack cache. Traditionally, a structure or object 
unfolding technique can be used to transform heap accesses into stack accesses. 
For the Java processor with a fixed-size stack, unlimited employment of structure 
unfolding techniques however will result in the size of local variables excelling the 
stack size. Thus it will reduce the performance gains. To solve this problem, we 
propose to first perform object unfolding, and then to assign heuristic weights to 
local variables, after that we then perform the reverse object-unfolding scheme 
to reverse a certain amount of local variables with least weights into a heap 
object. In this way, we can improve our performances on stack machines. 



4 Inter-procedural Stack Cache Optimization 

We also extend our work for inter-procedural cases. We model this problem 
into equations and propose a heuristic algorithm to solve the stack allocation 
problems for inter-procedural cases. In our heuristic algorithm, we use domain 
decomposition approach to partition the call graph into intervals, and then to 
find the solution for each small interval of the call-graph. We then can find the 
best solution in each partition. In addition, we use profiling scheme to obtain 
the call-graph information in our design. As our goal is to find the minimum 
total cost of reference costs spilled in the memory and the stack flush costs. We 
need to deal with this problem for an assumed scheme for a stack flush. In the 
following, we will assume that the basic unit for each stack flush is based on the 
amount of stack allocations for each method. When a new method is invoked, 
the system will check if addition of the stack allocation (which is the sum of 
the activation record and the operand stack allocation) of the new method into 
current watermark of the stack cache will overflow the stack cache. If the new 
watermark overflows the stack cache, a stack flush is done to spill the elements 
in the stack into memory. The spill is done by considering the stack allocation 
of each method as a basic unit. The system will spill as many units as needed 
to accommodate the new method invocation. Figures Q gives the outline of our 
heuristic algorithm. 
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Algorithm 1 

Input: 1. The call graph G with n methods, /i,/ 2 , 

2. The block size W for program partitions. 

Output: A heuristic scheduling for stack allocation. 

Begin 

Step 1: To use domain decomposition approach to partition the call graph 
of the program into block partitions 

Step 2: To find the solution for each small block of the call-graph. This can 
be done by solving the smaller problem for each sub-block, Blocki: 

Minimize : (Sg^^Blocki^ (Si. ej)) + <j>(Blocki, < Si . 92 . • ■ ■ . Sn >, < ei, 62 , . . . , e„ >), 
where ej is a possible stack allocation for gj. 

Step 3: Merging the assignments cross different program blocks. 

Step 4: Iterative back to Step 2 until the tunings no longer result 
in performance improvements. 

End 



Fig. 1. A heuristic algorithm based on domain decomposition for inter- 
procedural stack allocations 



5 Experiments 

In the current status of our experiment, we have built-in the cycle time and cost 
model estimation for an ITRI-made Java processor into a simulator based on 
Kaffe VMjg to estimate the cycle time with or without our optimization. The 
ITRI-made Java processor belongs to the family of Java processor with a fixed- 
size stack We are in the process of implementing our bytecode optimizer called 
JavaO. As our software with bytecode transformations was still not with the 
full strength to do the optimizations of our proposed algorithms automatically 
yet at this moment, we report the results by hand-code transformations. The 
hand-code transformation is done according to our proposed algorithm, and the 
transformed codes will then run through the simulator to estimate the cycle 
time. 

We look at the “JGFArithBench.java” in the Java Grande benchmarks suited 
for intra-procedural cases. We contrast the performance effects for two versions. 
In the base version noted as “w/o Reorder”, the reverse object-unfolding is done 
by converting the last few local variables in the original byte code programs. 
In the second version of our performance noted as “w/ Reorder”, the reverse 
object-unfolding is done according to our algorithm in Section 0 to choose the 
local variables for reverse unfolding. Figure 0 shows the intra-procedural results 
in reversing different number of slots for JGFrun method in JGFArithBench class. 
In this case, using our cost model for weight assignments to local variables has 
the improvement over the base version approximately 59% in the best case. 

Now let’s consider the inter-procedural cases. We illustrate the performance 
effects for the following example. Suppose we have methods A, B, G, and D. 
Methods A calls B, and after B returns then calls G, and then, after G returns, 
calls D. We also suppose that the size of the stack cache is 64 and the size of the 
frame state for context switch is 6. In this example, we assume the stack frame 
size for A is the same as the JGFrun method of JGFArithBench class (=41), 
and for B, G, and D are all 24. In addition, we also assume the frequency of 
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7.0E+07 

6.0E+07 

5.0E+07 


□ w/o Reorder PI ri 

■ w/ Reorder H I L 1 


^ 3.0E+07 
2.0E+07 
l.OE+07 






2 slots 
4 slots 

8 slots 

9 slots 

10 slots 

11 slots 

12 slots 
14 slots 
16 slots 
18 slots 

20 slots 

21 slots 

22 slots 
24 slots 
26 slots 
28 slots 



Fig. 2. The intra-procedural results in reversing different number of slots for 
JGFrun method in JGFArithBench class 



Table 1. The compare of stack flushes without our padding scheme, and the 
version with out padding scheme for inter-procedural cases. 



Case 


Hardware 


Heuristic 


Stack Flushes 


3 (in A) 


0 


Stack Spills 


3 (in A) 


0 


Other Expense 


0 


reverse 2 slots in A 


Total Cost (cycles) 


1476 


529 



local variables in A is the same as the JGFrun method. In this case, the ITRI 
Java processor occurred 3 flushes and 3 spills. However, after performing our 
heuristic algorithm for this program, we reverse 2 local variable slots in A into 
heap accesses. In summary, using the stack flush schemes without padding the 
procedure frame will consume an extra 1476 cycle time for stack flushes, while 
using our inter-procedural frame padding scheme in Section 0 only consumes 
an extra 529 cycles. Table E illustrates this performance result. Our scheme 
significantly reduces the stack flush overhead in this case. 

6 Conclusion 

In this paper, two categories of the solutions were reported. We considered the 
stack allocations for intra-procedural and inter-procedural cases with a family 
of Java processors. We feel our proposed schemes are important advances on the 
issues of stack cache allocations on modern stack machines implementing JVM. 
Gurrently, we are in the process of implementing our software algorithm by in- 
corporating softwares into a package called JavaGlass APip|. Early experiments 
indicate our proposed methods are promising in speedup Java programs on Java 
processors with a fixed size of stack caches. 
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